Ilias Diakonikolas Homepage

In every corner of machine learning and statistics, there is a need for estimators that work not just in an idealized model, but even when their assumptions are violated. Unfortunately, in high dimensions, being provably robust and being efficiently computable are often at odds with each other.

We give the first efficient algorithm for estimating the parameters of a high-dimensional Gaussian that is able to tolerate a constant fraction of corruptions that is independent of the dimension. Prior to our work, all known estimators either needed time exponential in the dimension to compute or could tolerate only an inverse-polynomial fraction of corruptions. Not only does our algorithm bridge the gap between robustness and algorithms, but also it turns out to be highly practical in a variety of settings.

Learning in the presence of outliers is a fundamental problem in statistics. Until recently, all known efficient unsupervised learning algorithms were very sensitive to outliers in high dimensions. In particular, even for the task of robust mean estimation under natural distributional assumptions, no efficient algorithm was known. Recent work in theoretical computer science gave the first efficient robust estimators for a number of fundamental statistical tasks, including mean and covariance estimation. Since then, there has been a flurry of research activity on algorithmic high-dimensional robust estimation in a range of settings. In this survey article, we introduce the core ideas and algorithmic techniques in the emerging area of algorithmic high-dimensional robust statistics with a focus on robust mean estimation. We also provide an overview of the approaches that have led to computationally efficient robust estimators for a range of broader statistical tasks and discuss new directions and opportunities for future work.

Estimating distributions from samples is a paradigmatic and fundamental unsupervised learning problem that has been studied in statistics since the late nineteenth century, starting with the pioneering work of Karl Pearson. During the past couple of decades, there has been a large body of work in computer science on this topic with a focus on {\em computational efficiency.}

The area of distribution estimation is well-motivated in its own right, and has seen a recent surge of research activity, in part due to the ubiquity of structured distributions in the natural and social sciences. Such structural properties of distributions are sometimes direct consequences of the underlying application problem, or they are a plausible explanation of the model under investigation.

In this chapter, we give a survey of both classical and modern techniques for distribution estimation, with a focus on recent algorithmic ideas developed in theoretical computer science. These ideas have led to computationally and statistically efficient algorithms for learning broad families of models. For the sake of concreteness, we illustrate these ideas with specific examples. Finally, we highlight outstanding challenges and research directions for future work.

Robust Learning of Multi-index Models via Iterative Subspace Approximation [abstract] [arxiv]
I. Diakonikolas, G. Iakovidis, D. M. Kane, N. Zarifis
Manuscript, 2025
We study the task of learning Multi-Index Models (MIMs) in the presence of label noise under the Gaussian distribution. A $K$-MIM on $\mathbb{R}^d$ is any function $f$ that only depends on a $K$-dimensional subspace, i.e., $f(\mathbf{x}) = g(\mathbf{W} \mathbf{x})$ for a link function $g$ on $\mathbb{R}^K$ and a $K\times d$ matrix $\mathbf{W}$. We consider a class of well-behaved MIMs with finite ranges that satisfy certain regularity properties. Our main contribution is a general noise-tolerant learning algorithm for this class whose complexity is qualitatively optimal in the Statistical Query (SQ) model. At a high-level, our algorithm attempts to iteratively construct better approximations to the defining subspace by computing low-degree moments of our function conditional on its projection to the subspace computed thus far, and adding directions with relatively large empirical moments. For well-behaved MIMs, we show that this procedure efficiently finds a subspace $V$ so that $f(\mathbf{x})$ is close to a function of the projection of $\mathbf{x}$ onto $V$, which can then be found by brute-force. Conversely, for functions for which these conditional moments do not necessarily help in finding better subspaces, we prove an SQ lower bound providing evidence that no efficient algorithm exists.
As concrete applications of our general algorithm, we provide significantly faster noise-tolerant learners for two well-studied concept classes:
* Multiclass Linear Classifiers: A multiclass linear classifier is any function $f: \mathbb{R}^d \to [K]$ of the form $f(\mathbf{x}) = \mathrm{argmax}_{I\in [K]}( \mathbf{w}^{(i)} \cdot \mathbf{x} + t_i)\;,$ where $\mathbf{w}^{(i)} \in \mathbb{R}^d$ and $t_i \in \mathbb{R}$. We give a {\em constant-factor} approximate agnostic learner for this class, i.e., an algorithm that achieves 0-1 error $O(\mathrm{OPT})+\epsilon$. Our algorithm has sample complexity $N = O(d) \, 2^{\mathrm{poly}(K/\epsilon)}$ and computational complexity $\mathrm{poly}(N)$. This is the first constant-factor agnostic learner for this class whose complexity is a fixed-degree polynomial in $d$. In the agnostic model, it was previously known that achieving error $\opt+\eps$ requires time $d^{\poly(1/\eps)}$, even for $K=2$. Perhaps surprisingly, we prove an SQ lower bound showing that achieving error $\opt+\eps$, for $\eps = 1/\poly(K)$, incurs complexity $d^{\Omega(K)}$ even for the simpler case of Random Classification Noise.
* Intersections of Halfspaces: An intersection of $K$ halfspaces is any function $f:\mathbb{R}^d\to\{\pm 1\}$ such that there exist $K$ halfspaces $h_i(\mathbf{x})$ with $f(\mathbf{x}) = 1$ if and only if $h_i(\mathbf{x}) = 1$ for all $i \in [K]$. We give an approximate agnostic learner for this class achieving 0-1 error $K \, \tilde{O}(\mathrm{OPT}) + \epsilon$. Our algorithm has sample complexity $N=O(d^2) \, 2^{\mathrm{poly}(K/\epsilon)}$ and computational complexity $\mathrm{poly}(N)$. This is the first agnostic learner for this class with {\em near-optimal} dependence on $\mathrm{OPT}$ in its error, whose complexity is a fixed-degree polynomial in $d$. Previous algorithms either achieved significantly worse error guarantees, or incurred $d^{\poly(1/\eps)}$ time (even for $K=2$).
Furthermore, we show that in the presence of random classification noise, the complexity of our algorithm is significantly better, scaling polynomially with $1/\epsilon$.

Implicit High-Order Moment Tensor Estimation and Learning Latent Variable Models [abstract] [arxiv]
I. Diakonikolas, D. Kane
Manuscript, 2024
We study the task of learning latent-variable models on $\R^d$ with $k$ hidden parameters. A common obstacle towards designing computationally efficient algorithms for such models is the necessity of approximating moment tensors of {\em super-constant} degree, scaling with the parameter $k$ or the desired accuracy $\eps>0$. Motivated by such learning applications, we develop a general efficient algorithm for {\em implicit moment tensor computation}. Roughly speaking, our algorithm computes in $\poly(d, k)$ time a succinct approximate description of tensors of the form $M_m = \sum_{i=1}^{k} w_i v_i^{\otimes m}$, for $w_i \in \R_+$---even for $m=\omega(1)$---assuming that there exists a {\em polynomial-size arithmetic circuit} whose expected output on an appropriate samplable distribution is equal to $M_m$, and whose covariance on this input is bounded. Our framework broadly generalizes, both conceptually and technically, the work of~\cite{LL21-opt} which developed an efficient algorithm for the specific moment tensors that arise in the task of clustering mixtures of spherical Gaussians.
By leveraging our general implicit moment estimation algorithm, we obtain the first $\poly(d, k)$-time learning algorithms for the following classical latent-variable models---thereby resolving or making significant progress towards a number of important open problems in the literature.
{\bf Mixtures of Linear Regressions} Given i.i.d.\ samples $(x, y)$ with $x \sim N(0, I)$ and such that the joint distribution on $(x, y)$ is an unknown $k$-mixture of linear regressions on $\R^{d+1}$ corrupted with Gaussian noise, the goal is to learn the underlying distribution in total variation distance. We give a $\poly(d, k, 1/\eps)$-time algorithm for this task, where $\eps$ is the desired error. The previously best algorithm has super-polynomial complexity in $k$.
{\bf Learning Mixtures of Spherical Gaussians} Given i.i.d.\ samples from a $k$-mixture of identity covariance Gaussians on $\R^d$, the goal is to learn the target mixture. We give a $\poly(d, k, 1/\eps)$-time density estimation algorithm, where $\eps$ is the desired total variation error, under the condition that the means lie in a ball of radius $O(\sqrt{\log k})$. Prior algorithms incur super-polynomial complexity in $k$. For parameter estimation, we give a $\poly(d, k, 1/\eps)$-time algorithm where $\eps$ is the target accuracy, under the {\em optimal} mean separation of $\Omega(\log^{1/2}(k/\eps))$ and the condition that the largest distance is comparable to the smallest. Prior polynomial-time parameter estimation algorithms require separation $\Omega(\log^{1/2+c}(k/\eps))$, for $c>0$.
{\bf PAC Learning Positive Linear Combinations of ReLUs} Given labeled examples $(x, y)$ with $x \sim N(0, I)$ and $y = F(x)$, where $F$ is an unknown positive linear combination of $k$ ReLUs (or other reasonable activations) on $\R^d$, the goal is to PAC learn the target function in $L_2$-norm. We give an algorithm for this task with complexity $\poly(d, k) 2^{\poly(1/\eps)}$, where $\eps$ is the desired relative $L_2$-error. Our is the first algorithm for this learning task that runs in fully-polynomial (i.e., $\poly(d, k)$) time for {\em subconstant} values of the error $\eps = o_{k, d}(1)$. All previously known algorithms have super-polynomial complexity, for any $\eps = o_{k, d}(1)$.

Learning Intersections of Two Margin Halfspaces under Factorizable Distributions
I. Diakonikolas, M. Ma, L. Ren, C. Tzamos
Proceedings of the 38th Annual Conference on Learning Theory (COLT 2025)

Robustly Learning Monotone Generalized Linear Models via Data Augmentation [abstract] [arxiv]
N. Zarifis, P. Wang, I. Diakonikolas, J. Diakonikolas
Proceedings of the 38th Annual Conference on Learning Theory (COLT 2025)
We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for {\em any} monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm~[Kakade et al. 2011]. Our robust learner applies more generally, encompassing all monotone activations with bounded $(2+\zeta)$-moments, for any fixed $\zeta>0$---a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.

Faster Algorithms for Agnostically Learning Disjunctions and their Implications [abstract] [arxiv]
I. Diakonikolas, D. Kane, L. Ren
Proceedings of the 38th Annual Conference on Learning Theory (COLT 2025)
Mark Fulk Best Student Paper Award at COLT 2025
We study the algorithmic task of learning Boolean disjunctions in the distribution-free agnostic PAC model. The best known agnostic learner for the class of disjunctions over $\{0, 1\}^n$ is the $L_1$-polynomial regression algorithm, achieving complexity $2^{\tilde{O}(n^{1/2})}$. This complexity bound is known to be nearly best possible within the class of Correlational Statistical Query (CSQ) algorithms. In this work, we develop an agnostic learner for this concept class with complexity $2^{\tilde{O}(n^{1/3})}$. Our algorithm can be implemented in the Statistical Query (SQ) model, providing the first separation between the SQ and CSQ models in distribution-free agnostic learning.

On Fine-Grained Distinct Element Estimation [abstract] [arxiv]
I. Diakonikolas, D. Kane, J. Lee, T. Pittas, D. Woodruff, S. Zhou
Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\epsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\epsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\epsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $\mathcal{O}\left(\alpha\log n+\frac{\sqrt{\beta}}{\epsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions n the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for dist with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.

On Learning Parallel Pancakes with Mostly Uniform Weights [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Karmalkar, J. Lee, T. Pittas
Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
Selected for Spotlight Presentation
We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\R^d$. This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.

Batch List-Decodable Linear Regression via Higher Moments [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Karmalkar, S. Liu, T. Pittas
Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
We study the task of list-decodable linear regression using batches. A batch is called clean if the points it contains are i.i.d. samples from an unknown linear regression distribution. For a parameter $\alpha \in (0, 1/2)$, an unknown $\alpha$-fraction of the batches are clean and no assumptions are made on the remaining batches. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in $\ell_2$-norm. \cite{DasJKS22} gave an efficient algorithm for this task, under natural distributional assumptions, with the following guarantee. Under the assumption that the batch size $n$ satisfies $n \geq \tilde{\Omega}(\alpha^{-1})$ and the number of batches is $m = \poly(d, n, 1/\alpha)$, their algorithm runs in polynomial time and outputs a list of $O(1/\alpha^2)$ vectors at least one of which is $\tilde{O}(\alpha^{-1/2}/\sqrt{n})$ close to the target regressor. Here we design a new polynomial-time algorithm for this task with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded. Specifically, for any constant $\delta>0$, as long as the batch size is $n \geq \Omega_{\delta}(\alpha^{-\delta})$ and the degree-$\Theta(1/\delta)$ moments of the covariates are SoS certifiably bounded, our algorithm uses $m = \poly((dn)^{1/\delta}, 1/\alpha)$ batches, runs in polynomial-time, and outputs an $O(1/\alpha)$-sized list of vectors one of which is $O(\alpha^{-\delta/2}/\sqrt{n})$ close to the target. That is, under our distributional assumptions, our algorithm achieves substantially smaller minimum batch size and final error, while achieving the optimal list size. Our approach leverages higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure for this setting. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.

Efficient Multivariate Robust Mean Estimation Under Mean-Shift Contamination [abstract] [arxiv]
I. Diakonikolas, G. Iakovidis, D. Kane, T. Pittas
Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
We study the algorithmic problem of robust mean estimation of an identity covariance Gaussian in the presence of mean-shift contamination. In this contamination model, we are given a set of points in $\R^d$ generated i.i.d. via the following process. For a parameter $\alpha<1/2$, the $i$-th sample $x_i$ is obtained as follows: with probability $1-\alpha$, $x_i$ is drawn from $N(\mu, I)$, where $\mu \in \R^d$ is the target mean; and with probability $\alpha$, $x_i$ is drawn from $N(z_i, I)$, where $z_i$ is unknown and potentially arbitrary. Prior work characterized the information-theoretic limits of this task. Specifically, it was shown that--- in contrast to Huber contamination--- in the presence of mean-shift contamination consistent estimation is possible. On the other hand, all known robust estimators in the mean-shift model have running times exponential in the dimension. Here we give the first computationally efficient algorithm for high-dimensional robust mean estimation with mean-shift contamination that can tolerate a constant fraction of outliers. In particular, our algorithm has near-optimal sample complexity, runs in sample-polynomial time, and approximates the target mean to {\em any} desired accuracy. Conceptually, our result contributes to a growing body of work that studies inference with respect to natural noise models lying in between fully adversarial and random settings.

Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise [abstract] [arxiv]
I. Diakonikolas, M. Ma, L. Ren, C. Tzamos
Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
Selected for Oral Presentation
We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $\R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove {\em super-polynomial} Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving {\em any} constant factor approximation to the optimal loss or even beating the trivial hypothesis.

Online Linear Classification with Massart Noise [abstract] [arxiv]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
We study the task of online learning in the presence of Massart noise. Specifically, instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\mathbf{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the ground-truth label of $\mathbf{x}$ with unknown probability {\em at most} $\eta$. We focus on the fundamental class of $\gamma$-margin linear classifiers and present the first computationally efficient algorithm that achieves mistake bound $\eta T + o(T)$. We point out that the mistake bound achieved by our algorithm is qualitatively tight for computationally efficient algorithms; this follows from the fact that, even in the offline setting, achieving 0-1 error better than $\eta$ requires super-polynomial time under standard complexity assumptions.
We extend our online learning model to a $k$-arm contextual bandit setting where the rewards---instead of satisfying commonly used realizability assumptions---are consistent, in expectation, with some linear ranking function with weight vector $\mathbf{w}^\ast$. Given a list of contexts $\mathbf{x}_1,\ldots \mathbf{x}_k$, if $\mathbf{w}^*\cdot \mathbf{x}_i > \mathbf{w}^* \cdot \mathbf{x}_j$, the expected reward of action $i$ must be larger than that of $j$ by at least $\Delta$. We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k)~ \Delta T - o(T)$ bigger than choosing a random action at every round.

Entangled Mean Estimation in High-Dimensions [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Liu, T. Pittas
Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC 2025)
We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given $N$ independent random points $x_1,\ldots,x_N$ in $\R^D$ and a parameter $\alpha \in (0, 1)$ such that each $x_i$ is drawn from a Gaussian with mean $\mu$ and unknown covariance, and an unknown $\alpha$-fraction of the points have {\em identity-bounded} covariances, the goal is to estimate the common mean $\mu$. The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work~\cite{LiaYua20, compton2024near} has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited.
In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is $f(\alpha,N) + \sqrt{D/(\alpha N)}$, where the term $f(\alpha,N)$ is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations $\hat \mu$ to $\mu$. This is achieved via a novel rejection sampling procedure that removes points significantly deviating from $\hat \mu$, as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result.

SoS Certificates for Sparse Singular Values and Their Applications: Robust Statistics, Subspace Distortion, and More [abstract] [arxiv]
I. Diakonikolas, S. Hopkins A. Pensia, S. Tiegel
Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC 2025)
We study \emph{sparse singular value certificates} for random rectangular matrices. If $M$ is an $n \times d$ matrix with independent Gaussian entries, we give a new family of polynomial-time algorithms which can certify upper bounds on the maximum of $\|M u\|$, where $u$ is a unit vector with at most $\eta n$ nonzero entries for a given $\eta \in (0,1)$. This basic algorithmic primitive lies at the heart of a wide range of problems across algorithmic statistics and theoretical computer science, including robust mean and covariance estimation, certification of distortion of random subspaces of $\R^n$, certification of the $2 \rightarrow p$ norm of a random matrix, and sparse principal component analysis.
Our algorithms certify a bound which is asymptotically smaller than the naive one, given by the maximum singular value of $M$, for nearly the widest-possible range of $n,d,$ and $\eta$. Efficiently certifying such a bound for a range of $n,d$ and $\eta$ which is larger by {\em any} polynomial factor than what is achieved by our algorithm would violate lower bounds in the statistical query and low-degree polynomials models. Our certification algorithm makes essential use of the Sum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop a new combinatorial connection between the \emph{graph matrix} approach to analyze random matrices with dependent entries, and the Efron-Stein decomposition of functions of independent random variables.
As applications of our certification algorithm, we obtain new efficient algorithms for a wide range of well-studied algorithmic tasks. In algorithmic robust statistics, we obtain new algorithms for robust mean and covariance estimation with tradeoffs between breakdown point and sample complexity, which are nearly matched by statistical query and low-degree polynomial lower bounds (that we establish). We also obtain new polynomial-time guarantees for certification of $\ell_1/\ell_2$ distortion of random subspaces of $\R^n$ (also with nearly matching lower bounds), sparse principal component analysis, and certification of the $2\rightarrow p$ norm of a random matrix.

SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications [abstract] [arxiv]
I. Diakonikolas, S. Hopkins A. Pensia, S. Tiegel
Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC 2025)
Invited to SIAM Journal on Computing Special Issue for STOC 2025
We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian}---a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.

A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise [abstract] [arxiv]
I. Diakonikolas, N. Zarifis
Advances in Neural Information Processing Systems (NeurIPS 2024)
Selected for Spotlight Presentation
We study the problem of PAC learning $\gamma$-margin halfspaces in the presence of Massart noise. Without computational considerations, the sample complexity of this learning problem is known to be $\widetilde{\Theta}(1/(\gamma^2 \eps))$. Prior computationally efficient algorithms for the problem incur sample complexity $\tilde{O}(1/(\gamma^4 \eps^3))$ and achieve 0-1 error of $\eta+\eps$, where $\eta<1/2$ is the upper bound on the noise rate. Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on $1/\eps$ is required for computationally efficient algorithms. Our main result is a computationally efficient learner with sample complexity $\widetilde{\Theta}(1/(\gamma^2 \eps^2))$, nearly matching this lower bound. In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses.

Reliable Learning of Halfspaces under Gaussian Marginals [abstract] [arxiv]
I. Diakonikolas, L. Ren, N. Zarifis
Advances in Neural Information Processing Systems (NeurIPS 2024)
Selected for Spotlight Presentation
We study the problem of PAC learning halfspaces in the reliable agnostic model of Kalai et al. (2012). The reliable PAC model captures learning scenarios where one type of error is costlier than the others. Our main positive result is a new algorithm for reliable learning of Gaussian halfspaces on $\R^d$ with sample and computational complexity $$d^{O(\log (\min\{1/\alpha, 1/\eps\}))}\min (2^{\log(1/\eps)^{O(\log (1/\alpha))}},2^{\poly(1/\eps)})\;,$$ where $\eps$ is the excess error and $\alpha$ is the bias of the optimal halfspace. We complement our upper bound with a Statistical Query lower bound suggesting that the $d^{\Omega(\log (1/\alpha))}$ dependence is best possible. Conceptually, our results imply a strong computational separation between reliable agnostic learning and standard agnostic learning of halfspaces in the Gaussian setting.

Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models [abstract] [arxiv]
P. Wang, N. Zarifis, I. Diakonikolas, J. Diakonikolas
Advances in Neural Information Processing Systems (NeurIPS 2024)
A single-index model (SIM) is a function of the form $\sigma(\vec w^{\ast} \cdot \vec x)$, where $\sigma: \R \to \R$ is a known link function and $\vec{w}^{\ast}$ is a hidden unit vector. We study the task of learning SIMs in the agnostic (a.k.a. adversarial label noise) model with respect to the $L^2_2$-loss under the Gaussian distribution. Our main result is a sample and computationally efficient agnostic proper learner that attains $L^2_2$-error of $O(\opt)+\eps$, where $\opt$ is the optimal loss. The sample complexity of our algorithm is $\tilde{O}(d^{\lceil k^{\ast}/2\rceil}+d/\eps)$, where $k^{\ast}$ is the information-exponent of $\sigma$ corresponding to the degree of its first non-zero Hermite coefficient. This sample bound nearly matches known CSQ lower bounds, even in the realizable setting. Prior algorithmic work in this setting had focused on learning in the realizable case or in the presence of semi-random noise. Prior computationally efficient robust learners required significantly stronger assumptions on the link function.

Active Learning of General Halfspaces: Label Queries vs Membership Queries [abstract] [arxiv]
I. Diakonikolas, D. Kane, M. Ma
Advances in Neural Information Processing Systems (NeurIPS 2024)
We study the problem of learning {\em general} (i.e., not necessarily homogeneous) halfspaces under the Gaussian distribution on $\R^d$ in the presence of some form of query access. In the classical pool-based active learning model, where the algorithm is allowed to make adaptive label queries to previously sampled points, we establish a strong information-theoretic lower bound ruling out non-trivial improvements over the passive setting. Specifically, we show that any active learner requires label complexity of $\tilde{\Omega}(d/(\log(m)\eps))$, where $m$ is the number of unlabeled examples. Specifically, to beat the passive label complexity of $\tilde{O}(d/\eps)$, an active learner requires a pool of $2^{\poly(d)}$ unlabeled samples. On the positive side, we show that this lower bound can be circumvented with membership query access, even in the agnostic model. Specifically, we give a computationally efficient learner with query complexity of $\tilde{O}(\min\{1/p, 1/\eps\} + d\cdot\polylog(1/\eps))$ achieving error guarantee of $O(\opt)+\eps$. Here $p \in [0, 1/2]$ is the bias and $\opt$ is the 0-1 loss of the optimal halfspace. As a corollary, we obtain a strong separation between the active and membership query models. Taken together, our results characterize the complexity of learning general halfspaces under Gaussian marginals in these models.

Learning a Single Neuron Robustly to Distributional Shifts and Adversarial Label Noise [abstract] [arxiv]
S. Li, S. Karmalkar, I. Diakonikolas, J. Diakonikolas
Advances in Neural Information Processing Systems (NeurIPS 2024)
We study the problem of learning a single neuron with respect to the $L_2^2$-loss in the presence of adversarial distribution shifts, where the labels can be arbitrary, and the goal is to find a ``best-fit'' function. More precisely, given training samples from a reference distribution $\mathcal{p}_0$, the goal is to approximate the vector $\mathbf{w}^*$ which minimizes the squared loss with respect to the worst-case distribution that is close in $\chi^2$-divergence to $\mathcal{p}_{0}$.
We design a computationally efficient algorithm that recovers a vector $\hat{\mathbf{w}}$ satisfying $\mathbb{E}_{\mathcal{p}^*} (\sigma(\hat{\mathbf{w}} \cdot \mathbf{x}) - y)^2 \leq C \, \mathbb{E}_{\mathcal{p}^*} (\sigma(\mathbf{w}^* \cdot \mathbf{x}) - y)^2 + \epsilon$, where $C>1$ is a dimension-independent constant and $(\mathbf{w}^*, \mathcal{p}^*)$ is the witness attaining the min-max risk $\min_{\mathbf{w}~:~\|\mathbf{w}\| \leq W} \max_{\mathcal{p}} \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{p}} (\sigma(\mathbf{w} \cdot \mathbf{x}) - y)^2 - \nu \chi^2(\mathcal{p}, \mathcal{p}_0)$. Our algorithm follows a primal-dual framework and is designed by directly bounding the risk with respect to the original, nonconvex $L_2^2$ loss. From an optimization standpoint, our work opens new avenues for the design of primal-dual algorithms under structured nonconvexity.

Sum-of-Squares Lower Bounds for Non-Gaussian Component Analysis [abstract] [arxiv]
I. Diakonikolas, S. Karmalkar, S. Pang, A. Potechin
Proceedings of the 65th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2024)
Non-Gaussian Component Analysis (NGCA) is the statistical task of finding a non-Gaussian direction in a high-dimensional dataset. Specifically, given i.i.d. samples from a distribution $P^A_{v}$ on $\mathbb{R}^n$ that behaves like a known distribution $A$ in a hidden direction $v$ and like a standard Gaussian in the orthogonal complement, the goal is to approximate the hidden direction. The standard formulation posits that the first $k-1$ moments of $A$ match those of the standard Gaussian and the $k$-th moment differs. Under mild assumptions, this problem has sample complexity $O(n)$. On the other hand, all known efficient algorithms require $\Omega(n^{k/2})$ samples. Prior work developed sharp Statistical Query and low-degree testing lower bounds suggesting an information-computation tradeoff for this problem.
Here we study the complexity of NGCA in the Sum-of-Squares (SoS) framework. Our main contribution is the first super-constant degree SoS lower bound for NGCA. Specifically, we show that if the non-Gaussian distribution $A$ matches the first $(k-1)$ moments of $\mathcal{N}(0, 1)$ and satisfies other mild conditions, then with fewer than $n^{(1 - \eps)k/2}$ many samples from the normal distribution, with high probability, degree $(\log n)^{{1\over 2}-o_n(1)}$ SoS fails to refute the existence of such a direction $v$. Our result significantly strengthens prior work by establishing a super-polynomial information-computation tradeoff against a broader family of algorithms. As corollaries, we obtain SoS lower bounds for several problems in robust statistics and the learning of mixture models.
Our SoS lower bound proof introduces a novel technique, that we believe may be of broader interest, and a number of refinements over existing methods. As in previous work, we use the framework of [Barak et al. FOCS 2016], where we express the moment matrix $M$ as a sum of graph matrices, find a factorization $M\approx LQL^T$ using minimum vertex separators, and show that with high probability $Q$ is positive semidefinite (PSD) while the errors are small. Our technical innovations are as follows. First, instead of the minimum weight separator used in prior work, we crucially make use of the minimum square separator. Second, proving that $Q$ is PSD poses significant challenges due to an intrinsic reason. In all prior work, the major part of $Q$ was always a constant term, meaning a matrix whose entries are constant functions of the input. Here, however, even after removing a small error term, $Q$ remains a nontrivial linear combination of non-constant, equally dominating terms. We develop an algebraic method to address this difficulty, which may have wider applications. Specifically, we model the multiplications between the ``important'' graph matrices by an $\mathbb R$-algebra, construct a representation of this algebra, and use it to analyze $Q$. Via this approach, we show that the PSDness of $Q$ boils down to the multiplicative identities of Hermite polynomials.

Agnostically Learning Multi-index Models with Queries [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 65th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2024)
Invited to SIAM Journal on Computing Special Issue for FOCS 2024
We study the power of query access for the fundamental task of agnostic learning under the Gaussian distribution. In the agnostic model, no assumptions are made on the labels of the examples and the goal is to compute a hypothesis that is competitive with the {\em best-fit} function in a known class, i.e., it achieves error $\opt+\eps$, where $\opt$ is the error of the best function in the class. We focus on a general family of Multi-Index Models (MIMs), which are $d$-variate functions that depend only on few relevant directions, i.e., have the form $g(\vec W \x)$ for an unknown link function $g$ and a $k \times d$ matrix $\vec W$. Multi-index models cover a wide range of commonly studied function classes, including real-valued function classes such as constant-depth neural networks with ReLU activations, and Boolean concept classes such as intersections of halfspaces.
Our main result shows that query access gives significant runtime improvements over random examples for agnostically learning both real-valued and Boolean-valued MIMs. Under standard regularity assumptions for the link function (namely, bounded variation or surface area), we give an agnostic query learner for MIMs with running time $O(k)^{\poly(1/\eps)} \; \poly(d) $. In contrast, algorithms that rely only on random labeled examples inherently require $d^{\poly(1/\epsilon)}$ samples and runtime, even for the basic problem of agnostically learning a single ReLU or a halfspace. As special cases of our general approach, we obtain the following results:
For the class of depth-$\ell$, width-$S$ ReLU networks on $\R^d$, our agnostic query learner runs in time $\poly(d) 2^{\poly(\ell S/\eps)}$. This bound qualitatively matches the runtime of an algorithm by~\cite{CKM22} for the realizable PAC setting with random examples.
For the class of arbitrary intersections of $k$ halfspaces on $\R^d$, our agnostic query learner runs in time $\poly(d) \, 2^{\poly(\log (k)/\eps)}$. Prior to our work, no improvement over the agnostic PAC model complexity (without queries) was known, even for the case of a single halfspace. \end{itemize} In both these settings, we provide evidence that the $2^{\poly(1/\eps)}$ runtime dependence is required for proper query learners, even for agnostically learning a single ReLU or halfspace.
In summary, our algorithmic result establishes a strong computational separation between the agnostic PAC and the agnostic PAC+Query models under the Gaussian distribution. Prior to our work, no such separation was known for any natural concept classes — even for the special case of agnostically learning a single halfspace, for which it was an open problem first posed by Feldman~\cite{Feldman08}. Our results are enabled by a general dimension-reduction technique that leverages query access to estimate gradients of (a smoothed version of) the underlying label function.

Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation [abstract] [arxiv]
I. Diakonikolas, D. Kane, J. Lee, T. Pittas
Proceedings of the 36th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2025)
We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge \alpha$ for some known parameter $\alpha$, and each $P_i$ has unknown covariance $\Sigma_i \preceq \sigma^2_i \cdot I_d$ for some unknown $\sigma_i$, the goal is to cluster the samples assuming a pairwise mean separation in the order of $(\sigma_i+\sigma_j)/\sqrt{\alpha}$ between every pair of components $P_i$ and $P_j$. Our main contributions are as follows:
For the special case of nearly uniform mixtures, we give the first polynomial-time algorithm for this clustering task. Prior work either required separation scaling with the maximum cluster standard deviation (i.e.~$\max_i \sigma_i$)~\cite{diakonikolas2022clustering} or required both additional structural assumptions and mean separation scaling as a large degree polynomial in $1/\alpha$~\cite{BKK22}.
For arbitrary (i.e.~general-weight) mixtures, we point out that accurate clustering is information-theoretically impossible under our fine-grained mean separation assumptions. We introduce the notion of a {\em clustering refinement} --- a list of not-too-small subsets satisfying a similar separation, and which can be merged into a clustering approximating the ground truth --- and show that it is possible to efficiently compute an accurate clustering refinement of the samples. Furthermore, under a variant of the ``no large sub-cluster'' condition introduced in prior work~\cite{BKK22}, we show that our algorithm will output an accurate clustering, not just a refinement, even for general-weight mixtures. As a corollary, we obtain efficient clustering algorithms for mixtures of well-conditioned high-dimensional log-concave distributions. Moreover, our algorithm is robust to a fraction of adversarial outliers comparable to $\alpha$.
At the technical level, our algorithm proceeds by first using list-decodable mean estimation to generate a polynomial-size list of possible cluster means, before successively pruning candidates using a carefully constructed convex program. In particular, the convex program takes as input a candidate mean $\hat{\mu}$ and a scale parameter $\hat{s}$, and determines the existence of a subset of points that could plausibly form a cluster with scale $\hat{s}$ centered around $\hat{\mu}$. While the natural way of designing this program makes it non-convex, we construct a convex relaxation which remains satisfiable by (and only by) not-too-small subsets of true clusters.

Fast Co-Training under Weak Dependence via Stream-Based Active Learning [abstract] [proceedings]
I. Diakonikolas, M. Ma, L. Ren, C. Tzamos
Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
Selected for Oral Presentation
Co-training is a classical semi-supervised learning method which only requires a small number of labeled examples for learning, under reasonable assumptions. Despite extensive literature on the topic, very few hypothesis classes are known to be provably efficiently learnable via co-training, even under very strong distributional assumptions. In this work, we study the co-training problem in the stream-based active learning model. We show that a range of natural concept classes are efficiently learnable via co-training, in terms of both label efficiency and computational efficiency.
We provide an efficient reduction of co-training under the standard assumption of weak dependence, in the stream-based active model, to online classification. As a corollary, we obtain efficient co-training algorithms with {\em error independent} label complexity for every concept class class efficiently learnable in the mistake bound online model. Our framework also gives co-training algorithms with label complexity $\tilde{O}(d\log (1/\eps))$ for any concept class with VC dimension $d$, though in general this reduction is not computationally efficient. Finally, using additional ideas from online learning, we design the first efficient co-training algorithms with label complexity $\tilde{O}(d^2\log (1/\eps))$ for several concept classes, including unions of intervals and homogeneous halfspaces.

Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Karmalkar, A. Pensia, T. Pittas
Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\R^d$ with corruption rate $\eps>0$, our algorithm has sample complexity $(k^2/\eps^2)\polylog(d/\eps)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(\eps)$. Previous efficient algorithms inherently incur error $\Omega(\eps \sqrt{\log(1/\eps)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.

Robustly Learning Single-Index Models via Alignment Sharpness [abstract] [arxiv]
N. Zarifis, P. Wang, I. Diakonikolas, J. Diakonikolas
Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
We study the problem of learning Single-Index Models under the $L_2^2$ loss in the agnostic model. We give an efficient learning algorithm, achieving a constant factor approximation to the optimal loss, that succeeds under a range of distributions (including log-concave distributions) and a broad class of monotone and Lipschitz link functions. This is the first efficient constant factor approximate agnostic learner, even for Gaussian data and for any nontrivial class of link functions. Prior work for the case of unknown link function either works in the realizable setting or does not attain constant factor approximation. The main technical ingredient enabling our algorithm and analysis is a novel notion of a local error bound in optimization that we term {\em alignment sharpness} and that may be of broader interest.

Testable Learning of General Halfspaces with Adversarial Label Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Liu, N. Zarifis
Proceedings of the 37th Annual Conference on Learning Theory (COLT 2024)
We study the task of testable learning of general--- not necessarily homogeneous---halfspaces with adversarial label noise with respect to the Gaussian distribution. In the testable learning framework, the goal is to develop a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data. Our main result is the first polynomial time tester-learner for general halfspaces that achieves dimension-independent misclassification error. At the heart of our approach is a new methodology to reduce testable learning of general halfspaces to testable learning of nearly homogeneous halfspaces that may be of broader interest.

Statistical Query Lower Bounds for Learning Truncated Gaussians [abstract] [arxiv]
I. Diakonikolas, D. Kane, T. Pittas, N. Zarifis
Proceedings of the 37th Annual Conference on Learning Theory (COLT 2024)
We study the problem of estimating the mean of an identity covariance Gaussian in the truncated setting, in the regime when the truncation set comes from a low-complexity family $\cal{C}$ of sets. Specifically, for a fixed but unknown truncation set $S \subseteq \R^d$, we are given access to samples from the distribution $N(\mathbf{\mu}, \vec I)$ truncated to the set $S$. The goal is to estimate $\mathbf{\mu}$ within accuracy $\eps>0$ in $\ell_2$-norm. Our main result is a Statistical Query (SQ) lower bound suggesting a super-polynomial information-computation gap for this task. In more detail, we show that the complexity of any SQ algorithm for this problem is $d^{\poly(1/\eps)}$, even when the class $\cal{C}$ is simple so that $\poly(d/\eps)$ samples information-theoretically suffice. Concretely, our SQ lower bound applies when $\cal{C}$ is a union of a bounded number of rectangles whose VC dimension and Gaussian surface are small. As a corollary of our construction, it also follows that the complexity of the previously known algorithm for this task is qualitatively best possible.

Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [abstract] [arxiv]
I. Diakonikolas, D. Kane *Author order randomized
Proceedings of the 37th Annual Conference on Learning Theory (COLT 2024)
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $\R^d$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/\eps)^{O(k)}$, where $\eps>0$ is the target accuracy. Prior work had given an algorithm for this problem with complexity $(dk/\eps)^{h(k)}$, where the function $h(k)$ scales super-polynomially in $k$. Interestingly, the complexity of our algorithm is near-optimal within the class of Correlational Statistical Query algorithms. At a high-level, our algorithm uses tensor decomposition to identify a subspace such that all the $O(k)$-order moments are small in the orthogonal directions. Its analysis makes essential use of the theory of Schur polynomials to show that the higher-moment error tensors are small given that the lower-order ones are.

Super Non-singular Decompositions of Polynomials and their Application to Robustly Learning Low-degree PTFs [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, S. Liu, N. Zarifis
Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC 2024)
We study the efficient learnability of low-degree polynomial threshold functions (PTFs) in the presence of a constant fraction of adversarial corruptions. Our main algorithmic result is a polynomial-time PAC learning algorithm for this concept class in the strong contamination model under the Gaussian distribution with error guarantee $O_{d, c}(\opt^{1-c})$, for any desired constant $c>0$, where $\opt$ is the fraction of corruptions. In the strong contamination model, an omniscient adversary can arbitrarily corrupt an $\opt$-fraction of the data points and their labels. This model generalizes the malicious noise model and the adversarial label noise model. Prior to our work, known polynomial-time algorithms in this corruption model (or even in the weaker adversarial label noise model) achieved error $\tilde{O}_d(\opt^{1/(d+1)})$, which deteriorates significantly as a function of the degree $d$.
Our algorithm employs an iterative approach inspired by localization techniques previously used in the context of learning linear threshold functions. Specifically, we use a robust perceptron algorithm to compute a good partial classifier and then iterate on the unclassified points. In order to achieve this, we need to take a set defined by a number of polynomial inequalities and partition it into several well-behaved subsets. To this end, we develop new polynomial decomposition techniques that may be of independent interest.

Testing Closeness of Multivariate Distributions via Ramsey Theory [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Liu
Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC 2024)
We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{\mathcal A_k} > \epsilon$, where $\|\mathbf p-\mathbf q\|_{\mathcal A_k}$ denotes the generalized $\mathcal A_k$ distance between $\mathbf p$ and $\mathbf q$ --- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound.
In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $\mathcal A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles.
Both our algorithm and our lower bound make essential use of tools from Ramsey theory.

How Does Wild Data Provably Help OOD Detection? [abstract] [arxiv]
X. Du, Z. Fang, I. Diakonikolas, Y. Li
Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)
Using unlabeled data to regularize the machine learning models has demonstrated promise for improving safety and reliability in detecting out-of-distribution (OOD) data. Harnessing the power of unlabeled in-the-wild data is non-trivial due to the heterogeneity of both in-distribution (ID) and OOD data. This lack of a clean set of OOD samples poses significant challenges in learning an optimal OOD classifier. Currently, there is a lack of research on formally understanding how unlabeled data helps OOD detection. This paper bridges the gap by introducing a new learning framework that offers both strong theoretical guarantees and empirical effectiveness. The framework separates candidate outliers from the unlabeled data and then trains an OOD classifier using the candidate outliers and the labeled ID data. Theoretically, we provide rigorous error bounds from the lens of separability and learnability, formally justifying the two components in our algorithm. Our theory shows that our framework can separate the candidate outliers with small error rates, which leads to a generalization guarantee for the learned OOD classifier. Empirically, our method achieves state-of-the-art performance on common benchmarks, reinforcing our theoretical insights.

Online Robust Mean Estimation [abstract] [arxiv]
D. Kane, I. Diakonikolas, H. Xiao, S. Liu *Author order randomized
Proceedings of the 35th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2024)
We study the problem of high-dimensional robust mean estimation in an online setting. Specifically, we consider a scenario where $n$ sensors are measuring some common, ongoing phenomenon. At each time step $t=1,2,\ldots,T$, the $i^{th}$ sensor reports its readings $x^{(i)}_t$ for that time step. The algorithm must then commit to its estimate $\mu_t$ for the true mean value of the process at time $t$. We assume that most of the sensors observe independent samples from some common distribution $X$, but an $\eps$-fraction of them may instead behave maliciously. The algorithm wishes to compute a good approximation $\mu$ to the true mean $\mu^\ast := \mathbf{E}[X]$. We note that if the algorithm is allowed to wait until time $T$ to report its estimate, this reduces to the well-studied problem of robust mean estimation. However, the requirement that our algorithm produces partial estimates as the data is coming in substantially complicates the situation.
We prove two main results about online robust mean estimation in this model. First, if the uncorrupted samples satisfy the standard condition of $(\eps,\delta)$-stability, we give an efficient online algorithm that outputs estimates $\mu_t$, $t \in [T],$ such that with high probability it holds that $\|\mu-\mu^\ast\|_2 = O(\delta \log(T))$, where $\mu = (\mu_t)_{t \in [T]}$. We note that this error bound is nearly competitive with the best offline algorithms, which would achieve $\ell_2$-error of $O(\delta)$. Our second main result shows that with additional assumptions on the input (most notably that $X$ is a product distribution) there are inefficient algorithms whose error does not depend on $T$ at all.

First Order Stochastic Optimization with Oblivious Noise [abstract] [arxiv]
I. Diakonikolas, S. Karmalkar, J. Park, C. Tzamos
Advances in Neural Information Processing Systems (NeurIPS 2023)
We initiate the study of stochastic optimization with oblivious noise, broadly generalizing the standard heavy-tailed noise setup. In our setting, in addition to random observation noise, the stochastic gradient may be subject to independent \emph{oblivious noise}, which may not have bounded moments and is not necessarily centered. Specifically, we assume access to a noisy oracle for the stochastic gradient of $f$ at $x$, which returns a vector $\nabla f(\gamma, x) + \xi$, where $\gamma$ is the bounded variance observation noise and $\xi$ is the oblivious noise that is independent of $\gamma$ and $x$. The only assumption we make on the oblivious noise $\xi$ is that $\Pr[\xi = 0] \ge \alpha$ for some $\alpha \in (0, 1)$. In this setting, it is not information-theoretically possible to recover a single solution close to the target when the fraction of inliers $\alpha$ is less than $1/2$. Our main result is an efficient {\em list-decodable} learner that recovers a small list of candidates at least one of which is close to the true solution. On the other hand, if $\alpha = 1-\eps$, where $0< \eps< 1/2$ is sufficiently small constant, the algorithm recovers a single solution. Along the way, we develop a rejection-sampling-based algorithm to perform noisy location estimation, which may be of independent interest.

SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions [abstract] [arxiv]
I. Diakonikolas, D. Kane, L. Ren, Y. Sun
Advances in Neural Information Processing Systems (NeurIPS 2023)
We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a methodology to prove SQ lower bounds for NGCA that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and a distribution that behaves like $A$ in a random hidden direction and like a standard Gaussian in the orthogonal complement, is SQ-hard. The required conditions were that (1) $A$ matches many low-order moments with a standard Gaussian, and (2) the chi-squared norm of $A$ with respect to the standard Gaussian is finite. While the moment-matching condition is necessary for hardness, the chi-squared condition was only required for technical reasons. In this work, we establish that the latter condition is indeed not necessary. In particular, we prove near-optimal SQ lower bounds for NGCA under the moment-matching condition only. Our SQ lower bound result can be generalized to the setting where the hidden subspace is multi-dimensional.

SQ Lower Bounds for Learning Mixtures of Linear Classifiers [abstract] [arxiv]
I. Diakonikolas, D. Kane, Y. Sun
Advances in Neural Information Processing Systems (NeurIPS 2023)
We study the problem of learning mixtures of linear classifiers under Gaussian covariates. Given sample access to a mixture of $r$ distributions on $\R^n$ of the form $(\mathbf{x},y_{\ell})$, $\ell \in [r]$, where $\mathbf{x}\sim\mathcal{N}(0,\mathbf{I}_n)$ and $y_\ell=\mathrm{sgn}(\langle\mathbf{v}_\ell,\mathbf{x}\rangle)$ for an unknown unit vector $\mathbf{v}_\ell$, the goal is to learn the underlying distribution in total variation distance. Our main result is a Statistical Query (SQ) lower bound suggesting that known algorithms for this problem are essentially best possible, even for the special case of uniform mixtures. In particular, we show that the complexity of any SQ algorithm for the problem is $n^{\poly(1/\Delta) \log(r)}$, where $\Delta$ is a lower bound on the pairwise $\ell_2$-separation between the $\mathbf{v}_\ell$'s. The key technical ingredient underlying our result is a new construction of spherical designs that may be of independent interest.

Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Pensia, T. Pittas
Advances in Neural Information Processing Systems (NeurIPS 2023)
We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both these problems. Specifically, for Gaussian robust mean estimation on $\R^d$ with contamination parameter $\eps \in (0, \eps_0)$ for a small absolute constant $\eps_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\eps^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\eps)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\eps^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\eps)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.

Robust Second-Order Nonconvex Optimization and Its Application to Low-Rank Matrix Sensing [abstract] [arxiv]
S. Li, Y. Cheng, I. Diakonikolas, J. Diakonikolas, R. Ge, S. Wright
Advances in Neural Information Processing Systems (NeurIPS 2023)
Finding an approximate second-order stationary point (SOSP) is a fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of the datapoints are arbitrarily corrupted. We develop an efficient algorithm that (under mild assumptions) outputs an approximate SOSP with \emph{dimension-independent} accuracy guarantees, using $\widetilde{O}(\frac{D^2}{\epsilon})$ samples where $D$ is the ambient dimension and $\epsilon$ is the fraction of corrupted datapoints. As our main application, we show that low-rank matrix sensing can be robustly solved in our framework, allowing for corruptions in both the sensing matrices and the measurements. On the lower bound side, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms.

Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise [abstract] [arxiv]
I. Diakonikolas, J. Diakonikolas, D. Kane, P. Wang, N. Zarifis
Advances in Neural Information Processing Systems (NeurIPS 2023)
We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces with Random Classification Noise under the Gaussian distribution. We establish nearly-matching algorithmic and Statistical Query (SQ) lower bound results revealing a surprising information-computation gap for this basic problem. Specifically, the sample complexity of this learning problem is $\widetilde{\Theta}(d/\eps)$, where $d$ is the dimension and $\eps$ is the excess error. Our positive result is a computationally efficient learning algorithm with sample complexity $\tilde{O}(d/\eps + d/(\max\{p, \eps\})^2)$, where $p$ quantifies the bias of the target halfspace. On the lower bound side, we show that any efficient SQ algorithm (or low-degree test) for the problem requires sample complexity at least $\Omega(d^{1/2}/(\max\{p, \eps\})^2)$. Our lower bound suggests that this quadratic dependence on $1/\eps$ is inherent for efficient algorithms.

Efficient Testable Learning of Halfspaces with Adversarial Label Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, S. Liu, N. Zarifis
Advances in Neural Information Processing Systems (NeurIPS 2023)
We give the first polynomial-time algorithm for the testable learning of halfspaces in the presence of adversarial label noise under the Gaussian distribution. In the recently introduced testable learning model, one is required to produce a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data. Our tester-learner runs in time $\poly(d/\eps)$ and outputs a halfspace with misclassification error $O(\opt)+\eps$, where $\opt$ is the 0-1 error of the best fitting halfspace. At a technical level, our algorithm employs an iterative soft localization technique enhanced with appropriate testers to ensure that the data distribution is sufficiently similar to a Gaussian.

A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm [abstract] [arxiv]
I. Diakonikolas, D. Kane, J. Lee, A. Pensia, T. Pittas
Advances in Neural Information Processing Systems (NeurIPS 2023)
Selected for Spotlight Presentation
We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\R^d$ such that an unknown $\alpha<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(\mu, \Sigma)$, the goal is to output a list of $O(1/\alpha)$ hypotheses at least one of which is close to $\Sigma$ in relative Frobenius norm. Our main result is a $\poly(d,1/\alpha)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\poly(1/\alpha)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) --- a key ingredient in the recent work of~\cite{BakDJKKV22} on robustly learning arbitrary GMMs. Combined with the other components of~\cite{BakDJKKV22}, our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs, resolving an open problem proposed by Vempala and Kothari. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings.

Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions [abstract] [arxiv]
I. Diakonikolas, S. Karmalkar, J. Park, C. Tzamos
Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023)
We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, the noisy labels are of the form $y = g(w^* \cdot x) + \xi + \eps$, where $\xi$ is the oblivious noise drawn independently of $x$ and satisfies $\Pr[\xi = 0] \geq o(1)$, and $\eps \sim N(0, \sigma^2)$. Our goal is to accurately recover a parameter vector $w$ such that the function $g(w \cdot x)$ has arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles this problem in its most general distribution-independent setting, where the solution may not even be identifiable. Our algorithm returns an accurate estimate of the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we provide a necessary and sufficient condition for identifiability, which holds in broad settings. Specifically, the problem is identifiable when the quantile at which $\xi + \eps = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first algorithmic result for GLM regression with oblivious noise which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.

Self-Directed Linear Classification [abstract] [arxiv]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023)
In online classification, a learner is presented with a sequence of examples and aims to predict their labels in an online fashion so as to minimize the total number of mistakes. In the self-directed variant, the learner knows in advance the pool of examples and can adaptively choose the order in which predictions are made. Here we study the power of choosing the prediction order and establish the first strong separation between worst-order and random-order learning for the fundamental task of linear classification. Prior to our work, such a separation was known only for very restricted concept classes, e.g., one-dimensional thresholds or axis-aligned rectangles. We present two main results. If $X$ is a dataset of $n$ points drawn uniformly at random from the $d$-dimensional unit sphere, we design an efficient self-directed learner that makes $O(d \log \log(n))$ mistakes and classifies the entire dataset. If $X$ is an arbitrary $d$-dimensional dataset of size $n$, we design an efficient self-directed learner that predicts the labels of $99\%$ of the points in $X$ with mistake bound independent of $n$. In contrast, under a worst- or random-ordering, the number of mistakes must be at least $\Omega(d \log n)$, even when the points are drawn uniformly from the unit sphere and the learner only needs to predict the labels for $1\%$ of them.

Information-Computation Tradeoffs for Learning Margin Halfspaces with Random Classification Noise [abstract] [arxiv]
I. Diakonikolas, J. Diakonikolas, D. Kane, P. Wang, N. Zarifis
Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023)
We study the problem of PAC learning $\gamma$-margin halfspaces with Random Classification Noise. We establish an information-computation tradeoff suggesting an inherent gap between the sample complexity of the problem and the sample complexity of computationally efficient algorithms. Concretely, the sample complexity of the problem is $\widetilde{\Theta}(1/(\gamma^2 \eps))$. We start by giving a simple efficient algorithm with sample complexity $\widetilde{O}(1/(\gamma^2 \eps^2))$. Our main result is a lower bound for Statistical Query (SQ) algorithms and low-degree polynomial tests suggesting that the quadratic dependence on $1/\eps$ in the sample complexity is inherent for computationally efficient algorithms. Specifically, our results imply a lower bound of $\widetilde{\Omega}(1/(\gamma^{1/2} \eps^2))$ on the sample complexity of any efficient SQ learner or low-degree test.

SQ Lower Bounds for Learning Mixtures of Separated and Bounded Covariance Gaussians [abstract] [arxiv]
I. Diakonikolas, D. Kane, T. Pittas, N. Zarifis
Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023)
We study the complexity of learning mixtures of separated Gaussians with common unknown bounded covariance matrix. Specifically, we focus on learning Gaussian mixture models (GMMs) on $\R^d$ of the form $P= \sum_{i=1}^k w_i N(\mu_i,\Sigma_i)$, where $\Sigma_i = \Sigma \preceq I$ and $\min_{i \neq j} \|\mu_i -\mu_j\|_2 \geq k^\eps$ for some $\eps>0$. Known learning algorithms for this family of GMMs have complexity $(dk)^{O(1/\eps)}$. In this work, we prove that any Statistical Query (SQ) algorithm for this problem requires complexity at least $d^{\Omega(1/\eps)}$. In the special case where the separation is on the order of $k^{1/2}$, we additionally obtain fine-grained SQ lower bounds with the correct exponent. Our SQ lower bounds imply similar lower bounds for low-degree polynomial tests. Conceptually, our results provide evidence that known algorithms for this problem are nearly best possible.

Statistical and Computational Limits for Tensor-on-Tensor Association Detection [abstract] [conference]
I. Diakonikolas, D. Kane, Y. Luo, A. Zhang
Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023)
In this paper, we consider the tensor-on-tensor association detection problem, where the goal is to detect whether there is an association between the tensor responses to tensor covariates linked via a low-rank tensor parameter. We first develop tight bounds on the signal-to-noise ratio (SNR) such that the detection problem is statistically possible. We then provide testing procedures that succeed when the SNR is above the threshold. On the other hand, the statistical optimal tests often require computing the largest singular value of a given tensor, which can be NP-hard in general. To complement that, we develop efficient polynomial-time testing procedures with provable guarantees. We also develop matching lower bounds under the Statistical Query model and show that the SNRs required by the proposed polynomial-time algorithms are essential for computational efficiency. We identify a gap that appears between the SNR requirements of the optimal unconstrained-time tests and polynomial-time tests if and only if the sum of the tensor response order and the tensor covariate order is no less than three. To our best knowledge, this is the first complete characterization of the statistical and computational limits for the general tensor-on-tensor association detection problem. Our findings significantly generalize the results in the literature on signal detection in linear regression and low-rank matrix trace regression.

A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points [abstract] [arxiv]
D. Kane, I. Diakonikolas *Author order randomized
Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023)
We prove that for $c>0$ a sufficiently small universal constant that a random set of $cd^2/\log^4(d)$ independent Gaussian random points in $\R^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of~\cite{SaundersonCPW12}, within logarithmic factors. The latter conjecture has attracted significant attention over the past decade, due to its connections to machine learning and sum-of-squares lower bounds for certain statistical problems.

Robustly Learning a Single Neuron via Sharpness [abstract] [arxiv]
P. Wang, N. Zarifis, I. Diakonikolas, J. Diakonikolas
Proceedings of the 40th International Conference on Machine Learning (ICML 2023)
Selected for Oral Presentation
We study the problem of learning a single neuron with respect to the $L_2^2$-loss in the presence of adversarial label noise. We give an efficient algorithm that, for a broad family of activations including ReLUs, approximates the optimal $L_2^2$-error within a constant factor. Our algorithm applies under much milder distributional assumptions compared to prior work. The key ingredient enabling our results is a novel connection to local error bounds from optimization theory.

Near-Optimal Cryptographic Hardness of Agnostically Learning Halfspaces and ReLU Regression under Gaussian Marginals [abstract] [arxiv]
I. Diakonikolas, D. Kane, L. Ren
Proceedings of the 40th International Conference on Machine Learning (ICML 2023)
We study the task of agnostically learning halfspaces under the Gaussian distribution. Specifically, given labeled examples $(\mathbf{x},y)$ from an unknown distribution on $\R^n \times \{ \pm 1\}$, whose marginal distribution on $\mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with 0-1 loss $\opt+\eps$, where $\opt$ is the 0-1 loss of the best-fitting halfspace. We prove a near-optimal computational hardness result for this task, under the widely believed sub-exponential time hardness of the Learning with Errors (LWE) problem. Prior hardness results are either qualitatively suboptimal or apply to restricted families of algorithms. Our techniques extend to yield near-optimal lower bounds for related problems, including ReLU regression.

Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Pensia, T. Pittas
Proceedings of the 40th International Conference on Machine Learning (ICML 2023)
We study principal component analysis (PCA), where given a dataset in $\R^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly-linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension.

A Strongly Polynomial Algorithm for Approximate Forster Transforms and its Application to Halfspace Learning [abstract] [arxiv]
I. Diakonikolas, C. Tzamos, D. Kane *Author order randomized
Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023)
The Forster transform is a method of regularizing a dataset by placing it in {\em radial isotropic position} while maintaining some of its essential properties. Forster transforms have played a key role in a diverse range of settings spanning computer science and functional analysis. Prior work had given {\em weakly} polynomial time algorithms for computing Forster transforms, when they exist. Our main result is the first {\em strongly polynomial time} algorithm to compute an approximate Forster transform of a given dataset or certify that no such transformation exists. By leveraging our strongly polynomial Forster algorithm, we obtain the first strongly polynomial time algorithm for {\em distribution-free} PAC learning of halfspaces. This learning result is surprising because {\em proper} PAC learning of halfspaces is {\em equivalent} to linear programming. Our learning approach extends to give a strongly polynomial halfspace learner in the presence of random classification noise and, more generally, Massart noise.

Gaussian Mean Testing Made Simple [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Pensia
SIAM Symposium on Simplicity in Algorithms (SOSA 2023)
We study the following fundamental hypothesis testing problem, which we term {\em Gaussian mean testing}. Given i.i.d. samples from a distribution $p$ on $\R^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $N(0,I_d)$, and (ii) $p$ is a Gaussian $N(\mu,\Sigma)$ for some unknown covariance $\Sigma$ and mean $\mu \in \R^d$ satisfying $\|\mu\|_2 \geq \epsilon$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $\Theta(\sqrt{d}/\epsilon^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time.

Cryptographic Hardness of Learning Halfspaces with Massart Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, P. Manurangsi, L. Ren
Advances in Neural Information Processing Systems (NeurIPS 2022)
We study the complexity of PAC learning halfspaces in the presence of Massart noise. In this problem, we are given i.i.d.\ labeled examples $(\mathbf{x}, y) \in \R^N \times \{ \pm 1\}$, where the distribution of $\mathbf{x}$ is arbitrary and the label $y$ is a Massart corruption of $f(\mathbf{x})$, for an unknown halfspace $f: \R^N \to \{ \pm 1\}$, with flipping probability $\eta(\mathbf{x}) \leq \eta < 1/2$. The goal of the learner is to compute a hypothesis with small 0-1 error. Our main result is the first computational hardness result for this learning problem. Specifically, assuming the (widely believed) subexponential-time hardness of the Learning with Errors (LWE) problem, we show that no polynomial-time Massart halfspace learner can achieve error better than $\Omega(\eta)$, even if the optimal 0-1 error is small, namely $\opt = 2^{-\log^{c} (N)}$ for any universal constant $c \in (0, 1)$. Prior work had provided qualitatively similar evidence of hardness in the Statistical Query model. Our computational hardness result essentially resolves the polynomial PAC learnability of Massart halfspaces, by showing that known efficient learning algorithms for the problem are nearly best possible.

SQ Lower Bounds for Learning Single Neurons with Massart Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, L. Ren, Y. Sun
Advances in Neural Information Processing Systems (NeurIPS 2022)
We study the problem of PAC learning a single neuron in the presence of Massart noise. Specifically, for a known activation function $f: \R \to \R$, the learner is given access to labeled examples $(\mathbf{x}, y) \in \R^d \times \R$, where the marginal distribution of $\mathbf{x}$ is arbitrary and the corresponding label $y$ is a Massart corruption of $f(\langle \mathbf{w}, \mathbf{x} \rangle)$. The goal of the learner is to output a hypothesis $h: \R^d \to \R$ with small squared loss. For a range of activation functions, including ReLUs, we establish super-polynomial Statistical Query (SQ) lower bounds for this learning problem. In more detail, we prove that no efficient SQ algorithm can approximate the optimal error within any constant factor. Our main technical contribution is a novel SQ-hard construction for learning $\{ \pm 1\}$-weight Massart halfspaces on the Boolean hypercube that is interesting on its own right.

Nearly-Tight Bounds for Testing Histogram Distributions [abstract] [arxiv]
C. Canonne, I. Diakonikolas, D. Kane, S. Liu
Advances in Neural Information Processing Systems (NeurIPS 2022)
We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, {\em $k$-histograms} over $[n]$, are probability distributions that are piecewise constant over a set of $k$ intervals. The histogram testing problem is the following: Given samples from an unknown distribution $\mathbf{p}$ on $[n]$, we want to distinguish between the cases that $\mathbf{p}$ is a $k$-histogram versus $\eps$-far from any $k$-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound. Specifically, we show that the histogram testing problem has sample complexity $\widetilde \Theta (\sqrt{nk} / \eps + k / \eps^2 + \sqrt{n} / \eps^2)$.

Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions [abstract] [arxiv]
I. Diakonikolas, D. Kane, J. Lee, A. Pensia
Advances in Neural Information Processing Systems (NeurIPS 2022)
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an {\em additive} $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.

List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Karmalkar, A. Pensia, T. Pittas
Advances in Neural Information Processing Systems (NeurIPS 2022)
Selected for Oral Presentation
We study the problem of list-decodable \emph{sparse} mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\R^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d.\ samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\hat \mu$ such that $\|\hat \mu - \mu\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ``certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\poly(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.

Outlier-Robust Sparse Estimation via Non-Convex Optimization [abstract] [arxiv]
Y. Cheng, I. Diakonikolas, R. Ge, S. Gupta, D. Kane, M. Soltanolkotabi
Advances in Neural Information Processing Systems (NeurIPS 2022)
We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that {\em any} approximate stationary point of the associated optimization problem yields a near-optimal solution for the underlying robust estimation task. As a corollary, we obtain that any first-order method that efficiently converges to stationarity yields an efficient algorithm for these tasks. The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work.

Learning General Halfspaces with Adversarial Label Noise via Online Gradient Descent [abstract] [proceedings]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 39th International Conference on Machine Learning (ICML 2022)
We study the problem of learning general, i.e., not necessarily homogeneous, halfspaces with adversarial label noise under the Gaussian distribution. Prior work has provided a sophisticated polynomial-time algorithm for this problem. In this work, we show that the problem can be solved directly via online gradient descent applied to a sequence of natural non-convex surrogates. This approach yields a simple iterative learning algorithm for general halfspaces with near-optimal sample complexity, runtime, and error guarantee. At the conceptual level, our work establishes an intriguing connection between learning halfspaces with adversarial noise and online optimization that may find other applications.

Streaming Algorithms for High-Dimensional Robust Statistics [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Pensia, T. Pittas
Proceedings of the 39th International Conference on Machine Learning (ICML 2022)
We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [abstract] [arxiv]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 35th Annual Conference on Learning Theory (COLT 2022)
We study the fundamental problem of learning a single neuron, i.e., a function of the form $\x \mapsto \sigma(w \cdot x)$ for monotone activations $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(x, y) \in \R^d \times \R$ such that there exists $w^\ast \in \R^d$ achieving $F(w^\ast) = \eps$, where $F(w) = \mathbf{E}_{(\x,y) \sim D}[(\sigma(w \cdot x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{w}$ such that $F(\wt{w}) = C \, \eps$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries:
For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors.
For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tilde{\Omega}(d/\eps)$.
In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

Robust Sparse Mean Estimation via Sum of Squares [abstract] [arxiv]
I. Diakonikolas, D. Kane, S. Karmalkar, A. Pensia, T. Pittas
Proceedings of the 35th Annual Conference on Learning Theory (COLT 2022)
We study the problem of high-dimensional sparse mean estimation in the presence of an $\eps$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\R^d$ with ``certifiably bounded'' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\eps^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\eps^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\eps)$ with sample complexity $m = O(k^4 \polylog(d))/\eps^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.

Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models [abstract] [arxiv]
I. Diakonikolas, D. Kane, Y. Sun
Proceedings of the 35th Annual Conference on Learning Theory (COLT 2022)
We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\eps$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\eps \sqrt{\log(1/\eps)})$. Similarly, we show that no efficient SQ algorithm with access to an $\eps$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\eps \log(1/\eps))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low-dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.

Non-Gaussian Component Analysis via Lattice Basis Reduction [abstract] [arxiv]
I. Diakonikolas, D. Kane
Proceedings of the 35th Annual Conference on Learning Theory (COLT 2022)
Non-Gaussian Component Analysis (NGCA) is the following distribution learning problem: Given i.i.d. samples from a distribution on $\R^d$ that is non-gaussian in a hidden direction $v$ and an independent standard Gaussian in the orthogonal directions, the goal is to approximate the hidden direction $v$. Prior work~\cite{DKS17-sq} provided formal evidence for the existence of an information-computation tradeoff for NGCA under appropriate moment-matching conditions on the univariate non-gaussian distribution $A$. The latter result does not apply when the distribution $A$ is discrete. A natural question is whether information-computation tradeoffs persist in this setting. In this paper, we answer this question in the negative by obtaining a sample and computationally efficient algorithm for NGCA in the regime that $A$ is discrete or nearly discrete, in a well-defined technical sense. The key tool leveraged in our algorithm is the LLL method~\cite{LLL82} for lattice basis reduction.

Near-Optimal Statistical Query Hardness of Learning Halfspaces with Massart Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane
Proceedings of the 35th Annual Conference on Learning Theory (COLT 2022)
We study the problem of PAC learning halfspaces with Massart noise. Given labeled samples $(x, y)$ from a distribution $D$ on $\R^{d} \times \{ \pm 1\}$ such that the marginal $D_x$ on the examples is arbitrary and the label $y$ of example $x$ is generated from the target halfspace corrupted by a Massart adversary with flipping probability $\eta(x) \leq \eta \leq 1/2$, the goal is to compute a hypothesis with small misclassification error. The best known $\poly(d, 1/\eps)$-time algorithms for this problem achieve error of $\eta+\eps$, which can be far from the optimal bound of $\opt+\eps$, where $\opt = \mathbf{E}_{x \sim D_x} [\eta(x)]$. While it is known that achieving $\opt+o(1)$ error requires super-polynomial time in the Statistical Query model, a large gap remains between known upper and lower bounds.
In this work, we essentially characterize the efficient learnability of Massart halfspaces in the Statistical Query (SQ) model. Specifically, we show that no efficient SQ algorithm for learning Massart halfspaces on $\R^d$ can achieve error better than $\Omega(\eta)$, even if $\opt = 2^{-\log^{c} (d)}$, for any universal constant $c \in (0, 1)$. Furthermore, when the noise upper bound $\eta$ is close to $1/2$, our error lower bound becomes $\eta - o_{\eta}(1)$, where the $o_{\eta}(1)$ term goes to $0$ when $\eta$ approaches $1/2$. Our results provide strong evidence that known learning algorithms for Massart halfspaces are nearly best possible, thereby resolving a longstanding open problem in learning theory.

Learning General Halfspaces with General Massart Noise under the Gaussian Distribution [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 54th Annual ACM Symposium on Theory of Computing (STOC 2022)
We study the problem of PAC learning halfspaces on $\mathbb{R}^d$ with Massart noise under the Gaussian distribution. In the Massart model, an adversary is allowed to flip the label of each point $\mathbf{x}$ with unknown probability $\eta(\mathbf{x}) \leq \eta$, for some parameter $\eta \in [0,1/2]$. The goal is to find a hypothesis with misclassification error of $\mathrm{OPT} + \eps$, where $\mathrm{OPT}$ is the error of the target halfspace. This problem had been previously studied under two assumptions: (i) the target halfspace is {\em homogeneous} (i.e., the separating hyperplane goes through the origin), and (ii) the parameter $\eta$ is {\em strictly} smaller than $1/2$. Prior to this work, no nontrivial bounds were known when either of these assumptions is removed. We study the general problem and establish the following:
For $\eta <1/2$, we give a learning algorithm for general halfspaces with sample and computational complexity $d^{O_{\eta}(\log(1/\gamma))}\mathrm{poly}(1/\epsilon)$, where $\gamma := \max\{\epsilon, \min\{\mathbf{Pr}[f(\mathbf{x}) = 1], \mathbf{Pr}[f(\mathbf{x}) = -1]\} \}$ is the ``bias'' of the target halfspace $f$. Prior efficient algorithms could only handle the special case of $\gamma = 1/2$. Interestingly, we establish a qualitatively matching lower bound of $d^{\Omega(\log(1/\gamma))}$ on the complexity of any Statistical Query (SQ) algorithm.
For $\eta = 1/2$, we give a learning algorithm for general halfspaces with sample and computational complexity $O_\epsilon(1) \,d^{O(\log(1/\epsilon))}$. This result is new even for the subclass of homogeneous halfspaces; prior algorithms for homogeneous Massart halfspaces provide vacuous guarantees for $\eta=1/2$. We complement our upper bound with a nearly-matching SQ lower bound of $d^{\Omega(\log(1/\epsilon) )}$, which holds even for the special case of homogeneous halfspaces.
Taken together, our results qualitatively characterize the complexity of learning general halfspaces with general Massart noise under Gaussian marginals. Our techniques rely on determining the existence (or non-existence) of low-degree polynomials whose expectations distinguish Massart halfspaces from random noise.

Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation [abstract] [arxiv]
I. Diakonikolas, D. M. Kane, D. Kongsgaard, J. Li, K. Tian
Proceedings of the 54th Annual ACM Symposium on Theory of Computing (STOC 2022)
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction of the points are arbitrary. The goal is to output a small list of vectors at least one of which is close to the mean of $\mathcal{D}$. As our main contribution, we develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $n^{1 + o(1)} d$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 \alpha$. As a corollary, we obtain the first {\em almost-linear time} algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades. The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of \cite{DiakonikolasKS18, DiakonikolasKK20}, providing a method to simultaneously cluster and downsample points using {\em one-dimensional} projections --- thus, bypassing the $k$-PCA subroutines required by prior algorithms.

Robustly Learning Mixtures of k Arbitrary Gaussians [abstract] [arxiv]
A. Bakshi, I. Diakonikolas, H. Jia, D. Kane, P. Kothari, S. Vempala
Proceedings of the 54th Annual ACM Symposium on Theory of Computing (STOC 2022)
We give a polynomial-time algorithm for the problem of robustly estimating a mixture of $k$ arbitrary Gaussians in $\R^d$, for any fixed $k$, in the presence of a constant fraction of arbitrary corruptions. This resolves the main open problem in several previous works on algorithmic robust statistics, which addressed the special cases of robustly estimating (a) a single Gaussian, (b) a mixture of TV-distance separated Gaussians, and (c) a uniform mixture of two Gaussians. Our main tools are an efficient \emph{partial clustering} algorithm that relies on the sum-of-squares method, and a novel tensor decomposition algorithm that allows errors in both Frobenius norm and low-rank terms.

Hardness of Learning a Single Neuron with Adversarial Label Noise [abstract] [proceedings]
I. Diakonikolas, D. Kane, P. Manurangsi, L. Ren
Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022)
Selected for Oral Presentation
We study the problem of distribution-free learning of a single neuron under adversarial label noise with respect to the squared loss. For a wide range of activation functions, including ReLUs and sigmoids, we prove hardness of learning results in the Statistical Query model and under a well-studied assumption on the complexity of refuting XOR formulas. Specifically, we establish that no polynomial-time learning algorithm, even improper, can approximate the optimal loss value within any constant factor.

Efficient Approximation Algorithms for the Inverse Semivalue Problem [abstract] [proceedings]
I. Diakonikolas, C. Pavlou, J. Peebles, A. Stewart
Proceedings of the 21st International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2022)
Weighted voting games are typically used to model situations where a number of agents vote against or for a proposal. In such games, a proposal is accepted if a weighted sum of the votes exceeds a specified threshold. As the influence of a player over the outcome is not in general proportional to her assigned weight, various power indices have been proposed to measure each player's influence. The inverse power index problem is the problem of designing a weighted voting game that achieves a set of desirable influences as they are measured by a predefined power index. Recent work has shown that exactly solving the inverse power index problem is computationally intractable when the power index is in the class of semivalues --- a broad class that includes the popular Shapley value and Banzhaf index. In this work, we design efficient approximation algorithms for the inverse semivalue problem. We develop a unified methodology that leads to computationally efficient algorithms that solve the inverse semivalue problem to any desired accuracy. We perform an extensive experimental evaluation of our algorithms on both synthetic and real inputs. Our experiments show that our algorithms are scalable and achieve higher accuracy compared to previous methods in the literature.

Forster Decomposition and Learning Halfspaces with Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, C. Tzamos
Advances in Neural Information Processing Systems (NeurIPS 2021)
Selected for Spotlight Presentation
A Forster transform is an operation that turns a distribution into one with good anti-concentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary.

ReLU Regression with Massart Noise [abstract] [arxiv]
I. Diakonikolas, J. Park, C. Tzamos
Advances in Neural Information Processing Systems (NeurIPS 2021)
We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.

Statistical Query Lower Bounds for List-Decodable Linear Regression [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Pensia, T. Pittas, A. Stewart
Advances in Neural Information Processing Systems (NeurIPS 2021)
Selected for Spotlight Presentation
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) \in \R^d \times \R$ and a parameter $0< \alpha <1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d.\ samples from a linear regression model with Gaussian covariates, and the remaining $(1-\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\poly(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.

List-Decodable Mean Estimation in Nearly-PCA Time [abstract] [arxiv]
I. Diakonikolas, D. Kane, D. Kongsgaard, J. Li, K. Tian
Advances in Neural Information Processing Systems (NeurIPS 2021)
Selected for Spotlight Presentation
Traditionally, robust statistics has focused on designing estimators tolerant to a minority of contaminated data. Robust {\em list-decodable learning} focuses on the more challenging regime where only a minority $1/k$ fraction of the dataset is drawn from the distribution of interest, for some $k \ge 2$, and no assumptions are made on the remaining data. In this paper, we study the fundamental task of list-decodable mean estimation in high dimensions. Our main result is a new list-decodable mean estimation algorithm for bounded covariance distributions with optimal sample complexity and error rate, running in {\em nearly-PCA time}. Specifically, assuming the ground truth distribution on $\R^d$ has covariance bounded by the identity, our algorithm outputs a list of $O(k)$ candidate means, one of which is within distance $O(\sqrt{k})$ from the true mean. Our algorithm runs in time $\widetilde{O}(ndk)$ for all $k = O(\sqrt{d}) \cup \Omega(d)$, where $n$ is the size of the dataset. We also show that a variant of our algorithm has runtime $\widetilde{O}(ndk)$ for all $k$, at the expense of an $O(\sqrt{\log k})$ factor in the recovery guarantee. This runtime matches up to logarithmic factors the cost of performing a single $k$-PCA on the data, which is a natural bottleneck of known algorithms for (very) special cases of our problem, such as clustering well-separated mixtures. Prior to our work, the fastest list-decodable mean estimation algorithms had runtimes $\widetilde{O}(n^2 d k^2)$~\cite{DiakonikolasKK20}, and $\widetilde{O}(nd k^C)$ \cite{CherapanamjeriMY20} for an unspecified constant $C \geq 6$.
Our approach builds on a novel soft downweighting method, which is arguably the simplest known polynomial-time mean estimation technique in the list-decodable learning setting. To develop our fast algorithms, we boost the computational cost of our simpler algorithm via a careful ``win-win-win'' analysis of an approximate Ky Fan matrix multiplicative weights procedure we develop, which we believe may be of independent interest.

Agnostic Proper Learning of Halfspaces under Gaussian Marginals [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021)
We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other non-linear activations, yielding in particular the first proper agnostic algorithm for ReLU regression.

The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals [abstract] [arxiv]
I. Diakonikolas, D. Kane, T. Pittas, N. Zarifis
Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021)
We study the problem of agnostic learning under the Gaussian distribution. We develop a method for finding hard families of examples for a wide class of problems by using LP duality. For Boolean-valued concept classes, we show that the $L^1$-regression algorithm is essentially best possible, and therefore that the computational difficulty of agnostically learning a concept class is closely related to the polynomial degree required to approximate any function from the class in $L^1$-norm. Using this characterization along with additional analytic tools, we obtain optimal SQ lower bounds for agnostically learning linear threshold functions and the first non-trivial SQ lower bounds for polynomial threshold functions and intersections of halfspaces. We also develop an analogous theory for agnostically learning real-valued functions, and as an application prove near-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids.

Boosting in the Presence of Massart Noise [abstract] [arxiv]
I. Diakonikolas, R. Impagliazzo, D. M. Kane, R. Lei, J. Sorrell, C. Tzamos
Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021)
We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $\eta(x) \leq \eta$, where $\eta<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $\eta<1/2$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles.

Outlier-Robust Learning of Ising Models Under Dobrushin's Condition [abstract] [arxiv]
I. Diakonikolas, D. M. Kane, A. Stewart, Y. Sun
Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021)
We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-$2$ polynomials of Ising models that may be of independent interest.

The Sample Complexity of Robust Covariance Testing [abstract] [arxiv]
I. Diakonikolas, D. Kane
Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021)
We study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber's contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-\eps) X + \eps B$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, \Sigma)$, $B$ is a fixed but unknown noise distribution, and $\eps>0$ is an arbitrarily small constant representing the proportion of contamination. We want to distinguish between the cases that $\Sigma$ is the identity matrix versus $\gamma$-far from the identity in Frobenius norm.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. Moreover, this sample upper bound was shown to be best possible, within constant factors. Our main result is that the sample complexity of covariance testing dramatically increases in the contaminated setting. In particular, we prove a sample complexity lower bound of $\Omega(d^2)$ for $\eps$ an arbitrarily small constant and $\gamma = 1/2$. This lower bound is best possible, as $O(d^2)$ samples suffice to even robustly {\em learn} the covariance. The conceptual implication of our result is that, for the natural setting we consider, robust hypothesis testing is at least as hard as robust estimation.

Learning Online Algorithms with Distributional Advice [abstract] [proceedings]
I. Diakonikolas, V. Kontonis, C. Tzamos, A. Vakilian, N. Zarifis
Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
We study the problem of designing online algorithms given advice about the input. While prior work had focused on deterministic advice, we only assume distributional access to the instances of interest, and the goal is to learn a competitive algorithm given access to i.i.d. samples. We aim to be competitive against an adversary with prior knowledge of the distribution, while also performing well against worst-case inputs.
We focus on the classical online problems of ski-rental and prophet-inequalities, and provide sample complexity bounds for the underlying learning tasks. First, we point out that for general distributions it is information-theoretically impossible to beat the worst-case competitive-ratio with any finite sample size. As our main contribution, we establish strong positive results for well-behaved distributions. Specifically, for the broad class of log-concave distributions, we show that $\mathrm{poly}(1/\epsilon)$ samples suffice to obtain $(1+\epsilon)$-competitive ratio. Finally, we show that this sample upper bound is close to best possible, even for very simple classes of distributions.

A Polynomial Time Algorithm for Learning Halfspaces with Tsybakov Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 53rd Annual ACM Symposium on Theory of Computing (STOC 2021)
We study the problem of PAC learning homogeneous halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, the label of every sample is independently flipped with an adversarially controlled probability that can be arbitrarily close to $1/2$ for a fraction of the samples. {\em We give the first polynomial-time algorithm for this fundamental learning problem.} Our algorithm learns the true halfspace within any desired accuracy $\eps$ and succeeds under a broad family of well-behaved distributions including log-concave distributions. Prior to our work, the only previous algorithm for this problem required quasi-polynomial runtime in $1/\eps$.
Our algorithm employs a recently developed reduction~\cite{DKTZ20b} from learning to certifying the non-optimality of a candidate halfspace. This prior work developed a quasi-polynomial time certificate algorithm based on polynomial regression. {\em The main technical contribution of the current paper is the first polynomial-time certificate algorithm.} Starting from a non-trivial warm-start, our algorithm performs a novel ``win-win'' iterative process which, at each step, either finds a valid certificate or improves the angle between the current halfspace and the true one. Our warm-start algorithm for isotropic log-concave distributions involves a number of analytic tools that may be of broader interest. These include a new efficient method for reweighting the distribution in order to recenter it and a novel characterization of the spectrum of the degree-$2$ Chow parameters.

Learning Halfspaces with Tsybakov Noise [abstract] [arxiv]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 53rd Annual ACM Symposium on Theory of Computing (STOC 2021)
(Conference version to be merged with paper above.)
We study the efficient PAC learnability of halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, each label is independently flipped with some probability which is controlled by an adversary. This noise model significantly generalizes the Massart noise model, by allowing the flipping probabilities to be arbitrarily close to $1/2$ for a fraction of the samples.
Our main result is the first non-trivial PAC learning algorithm for this problem under a broad family of structured distributions --- satisfying certain concentration and (anti-)anti-concentration properties --- including log-concave distributions. Specifically, we given an algorithm that achieves misclassification error $\eps$ with respect to the true halfspace, with quasi-polynomial runtime dependence in $1/\eps$. The only previous upper bound for this problem --- even for the special case of log-concave distributions --- was doubly exponential in $1/\eps$ (and follows via the naive reduction to agnostic learning).
Our approach relies on a novel computationally efficient procedure to certify whether a candidate solution is near-optimal, based on semi-definite programming. We use this certificate procedure as a black-box and turn it into an efficient learning algorithm by searching over the space of halfspaces via online convex optimization.

Optimal Testing of Discrete Distributions with High Probability [abstract] [arxiv]
I. Diakonikolas, T. Gouleakis, D. Kane, J. Peebles, E. Price
Proceedings of the 53rd Annual ACM Symposium on Theory of Computing (STOC 2021)
We study the problem of testing discrete distributions with a focus on the high probability regime. Specifically, given samples from one or more discrete distributions, a property $\mathcal{P}$, and parameters $0< \eps, \delta <1$, we want to distinguish {\em with probability at least $1-\delta$} whether these distributions satisfy $\mathcal{P}$ or are $\eps$-far from $\mathcal{P}$ in total variation distance. Most prior work in distribution testing studied the constant confidence case (corresponding to $\delta = \Omega(1)$), and provided sample-optimal testers for a range of properties. While one can always boost the confidence probability of any such tester by black-box amplification, this generic boosting method typically leads to sub-optimal sample bounds.
Here we study the following broad question: For a given property $\mathcal{P}$, can we {\em characterize} the sample complexity of testing $\mathcal{P}$ as a function of all relevant problem parameters, including the error probability $\delta$? Prior to this work, uniformity testing was the only statistical task whose sample complexity had been characterized in this setting. As our main results, we provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors, as a function of all relevant parameters. We also show matching information-theoretic lower bounds on the sample complexity of these problems. Our techniques naturally extend to give optimal testers for related problems. To illustrate the generality of our methods, we give optimal algorithms for testing collections of distributions and testing closeness with unequal sized samples.

Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees [abstract] [arxiv]
S. Macke, M. Aliakbarpour, I. Diakonikolas, A. Parameswaran, R. Rubinfeld
Proceedings of the 37th IEEE International Conference on Data Engineering (ICDE 2021)
Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptotic guarantees, making them unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how to accelerate queries involving aggregates on a real dataset with speedups of up to 124x over traditional AQP-with-guarantees and more than 1000x over exact methods.

Outlier Robust Mean Estimation with Subgaussian Rates via Stability [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Pensia
Advances in Neural Information Processing Systems (NeurIPS 2020)
We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.

The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, P. Manurangsi
Advances in Neural Information Processing Systems (NeurIPS 2020)
We study the computational complexity of adversarially robust proper learning of halfspaces in the distribution-independent agnostic PAC model, with a focus on $L_p$ perturbations. We give a computationally efficient learning algorithm and a nearly matching computational hardness result for this problem. An interesting implication of our findings is that the $L_{\infty}$ perturbations case is provably computationally harder than the case $2 \leq p < \infty$.

Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals [abstract] [arxiv]
I. Diakonikolas, D. Kane, N. Zarifis
Advances in Neural Information Processing Systems (NeurIPS 2020)
We study the fundamental problems of agnostically learning halfspaces and ReLUs under Gaussian marginals. In the former problem, given labeled examples $(\mathbf{x}, y)$ from an unknown distribution on $\R^d \times \{ \pm 1\}$, whose marginal distribution on $\mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with 0-1 loss $\opt+\eps$, where $\opt$ is the 0-1 loss of the best-fitting halfspace. In the latter problem, given labeled examples $(\mathbf{x}, y)$ from an unknown distribution on $\R^d \times \R$, whose marginal distribution on $\mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with square loss $\opt+\eps$, where $\opt$ is the square loss of the best-fitting ReLU. We prove Statistical Query (SQ) lower bounds of $d^{\poly(1/\eps)}$ for both of these problems. Our SQ lower bounds provide strong evidence that current upper bounds for these tasks are essentially best possible.

List-Decodable Mean Estimation via Iterative Multi-Filtering [abstract] [arxiv]
I. Diakonikolas, D. Kane, D. Kongsgaard
Advances in Neural Information Processing Systems (NeurIPS 2020)
We study the problem of {\em list-decodable mean estimation} for bounded covariance distributions. Specifically, we are given a set T of points in with the promise that an unknown $\alpha$-fraction of points in T, where $0<\alpha<1/2$, are drawn from an unknown mean and bounded covariance distribution D, and no assumptions are made on the remaining points. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the mean of D. We give the first practically viable estimator for this problem. In more detail, our algorithm is sample and computationally efficient, and achieves information-theoretically near-optimal error. While the only prior algorithm for this setting inherently relied on the ellipsoid method, our algorithm is iterative and only uses spectral techniques. Our main technical innovation is the design of a soft outlier removal procedure for high-dimensional heavy-tailed datasets with a majority of outliers.

Non-Convex SGD Learns Halfspaces with Adversarial Label Noise [abstract] [arxiv]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Advances in Neural Information Processing Systems (NeurIPS 2020)
We study the problem of agnostically learning homogeneous halfspaces in the distribution-specific PAC model. For a broad family of structured distributions, including log-concave distributions, we show that non-convex SGD efficiently converges to a solution with misclassification error $O(\opt)+\eps$, where $\opt$ is the misclassification error of the best-fitting halfspace. In sharp contrast, we show that optimizing any convex surrogate inherently leads to misclassification error of $\omega(\opt)$, even under Gaussian marginals.

Small Covers for Near-Zero Sets of Polynomials and Learning Latent Variable Models [abstract] [arxiv]
I. Diakonikolas, D. Kane
Proceedings of the 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2020)
Let $V$ be any vector space of multivariate degree-$d$ homogeneous polynomials with co-dimension at most $k$, and $S$ be the set of points where all polynomials in $V$ {\em nearly} vanish. We establish a qualitatively optimal upper bound on the size of $\eps$-covers for $S$, in the $\ell_2$-norm. Roughly speaking, we show that there exists an $\eps$-cover for $S$ of cardinality $M = (k/\eps)^{O_d(k^{1/d})}$. Our result is constructive yielding an algorithm to compute such an $\eps$-cover that runs in time $\poly(M)$. Building on our structural result, we obtain significantly improved learning algorithms for several fundamental high-dimensional probabilistic models with hidden variables. These include density and parameter estimation for $k$-mixtures of spherical Gaussians (with known common covariance), PAC learning one-hidden-layer ReLU networks with $k$ hidden units (under the Gaussian distribution), density and parameter estimation for $k$-mixtures of linear regressions (with Gaussian covariates), and parameter estimation for $k$-mixtures of hyperplanes. Our algorithms run in time {\em quasi-polynomial} in the parameter $k$. Previous algorithms for these problems had running times exponential in $k^{\Omega(1)}$.

Robustly Learning any Clusterable Mixture of Gaussians [abstract] [arxiv]
I. Diakonikolas, S. Hopkins, D. Kane, S. Karmalkar
Proceedings of the 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2020)
(Conference version to be merged with this paper)
We study the efficient learnability of high-dimensional Gaussian mixtures in the outlier-robust setting, where a small constant fraction of the data is adversarially corrupted. We resolve the polynomial learnability of this problem when the components are pairwise separated in total variation distance. Specifically, we provide an algorithm that, for any constant number of components $k$, runs in polynomial time and learns the components of an $\eps$-corrupted $k$-mixture within information theoretically near-optimal error of $\tilde{O}(\eps)$, under the assumption that the overlap between any pair of components $P_i, P_j$ (i.e., the quantity $1-TV(P_i, P_j)$) is bounded by $\mathrm{poly}(\eps)$. Our separation condition is the qualitatively weakest assumption under which accurate clustering of the samples is possible. In particular, it allows for components with arbitrary covariances and for components with identical means, as long as their covariances differ sufficiently. Ours is the first polynomial time algorithm for this problem, even for $k=2$. Our algorithm follows the Sum-of-Squares based \emph{proofs to algorithms} approach. Our main technical contribution is a new robust identifiability proof of clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system. The key ingredients of this proof are a novel use of \emph{SoS-certifiable anti-concentration} and a new characterization of pairs of Gaussians with small (dimension-independent) overlap in terms of their parameter distance.

High-Dimensional Robust Mean Estimation via Gradient Descent [abstract] [arxiv]
Y. Cheng, I. Diakonikolas, R. Ge, M. Soltanolkotabi
Proceedings of the 37th International Conference on Machine Learning (ICML 2020)

Efficiently Learning Adversarially Robust Halfspaces with Noise [abstract] [arxiv]
O. Montasser, S. Goel, I. Diakonikolas, N. Srebro
Proceedings of the 37th International Conference on Machine Learning (ICML 2020)

Efficient Algorithms for Multidimensional Segmented Regression [abstract] [arxiv]
I. Diakonikolas, J. Li, A. Voloshinov
Manuscript, 2020

Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Kontonis, N. Zarifis
Proceedings of the 33rd Annual Conference on Learning Theory (COLT 2020)

Approximation Schemes for ReLU Regression [abstract] [arxiv]
I. Diakonikolas, S. Goel, S. Karmalkar, A. Klivans, M. Soltanolkotabi
Proceedings of the 33rd Annual Conference on Learning Theory (COLT 2020)

Learning Halfspaces with Massart Noise Under Structured Distributions [abstract] [arxiv]
I. Diakonikolas, V. Kontonis, C. Tzamos, N. Zarifis
Proceedings of the 33rd Annual Conference on Learning Theory (COLT 2020)

Private Testing of Distributions via Sample Permutations [abstract] [proceedings]
M. Aliakbarpour, I. Diakonikolas, D. Kane, R. Rubinfeld
Advances in Neural Information Processing Systems (NeurIPS 2019)

Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin [abstract] [arxiv]
I. Diakonikolas, D. Kane, P. Manurangsi
Advances in Neural Information Processing Systems (NeurIPS 2019)
Selected for Spotlight Presentation at NeurIPS 2019

Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering [abstract] [arxiv]
I. Diakonikolas, S. Karmalkar, D. Kane, E. Price, A. Stewart
Advances in Neural Information Processing Systems (NeurIPS 2019)

Distribution-Independent PAC Learning of Halfspaces with Massart Noise [abstract] [arxiv]
I. Diakonikolas, T. Gouleakis, C. Tzamos
Advances in Neural Information Processing Systems (NeurIPS 2019)
Outstanding Paper Award at NeurIPS 2019
See here for a description of our work by the program chairs and here for the slides of my NeurIPS talk.
We study the problem of {\em distribution-independent} PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples $(\mathbf{x}, y)$ drawn from a distribution $\mathcal{D}$ on $\R^{d+1}$ such that the marginal distribution on the unlabeled points $\mathbf{x}$ is arbitrary and the labels $y$ are generated by an unknown halfspace corrupted with Massart noise at noise rate $\eta<1/2$. The goal is to find a hypothesis $h$ that minimizes the misclassification error $Pr_{(\mathbf{x}, y) \sim \mathcal{D}} \left[ h(\mathbf{x}) \neq y \right]$.
We give a $\poly(d, 1/\eps)$ time algorithm for this problem with misclassification error $\eta+\eps$. We also provide evidence that improving on the error guarantee of our algorithm might be computationally hard. Prior to our work, no efficient weak (distribution-independent) learner was known in this model, even for the class of disjunctions. The existence of such an algorithm for halfspaces (or even disjunctions) has been posed as an open question in various works, starting with Sloan (1988), Cohen (1997), and was most recently highlighted in Avrim Blum's FOCS 2003 tutorial.

Equipping Experts/Bandits with Long-term Memory [abstract] [arxiv]
K. Zheng, H. Luo, I. Diakonikolas, L. Wang
Advances in Neural Information Processing Systems (NeurIPS 2019)
We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002,by reducing the problem to achieving typical switching regret. Specifically, for the classical expert problem with $K$ actions and $T$ rounds, using our framework we develop various algorithms with a regret bound of order $O(\sqrt{T(S\ln T + n \ln K)})$ compared to any sequence of experts with $S-1$ switches among $n \leq \min\{S, K\}$ distinct experts. In addition, by plugging specific adaptive algorithms into our framework, we also achieve the best of both stochastic and adversarial environments simultaneously. This resolves an open problem of Warmuth and Koolen, 2014. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a side result, our lower bound also implies that sparse losses do not help improve the worst-case regret for contextual bandits, a sharp contrast with the non-contextual case.

A Polynomial Time Algorithm for Log-Concave Maximum Likelihood via Locally Exponential Families [abstract] [arxiv]
B. Axelrod, I. Diakonikolas, A. Sidiropoulos, A. Stewart, G. Valiant
Advances in Neural Information Processing Systems (NeurIPS 2019)
We study the problem of computing the maximum likelihood estimator (MLE) of multivariate log-concave densities. Our main result is the first computationally efficient algorithm for this problem. In more detail, we give an algorithm that, on input a set of $n$ points in $\R^d$ and an accuracy parameter $\epsilon>0$, it runs in time $\poly(n, d, 1/\epsilon)$, and outputs a log-concave density that with high probability maximizes the log-likelihood up to an additive $\epsilon$. Our approach relies on a natural convex optimization formulation of the underlying problem that can be efficiently solved by a projected stochastic subgradient method. The main challenge lies in showing that a stochastic subgradient of our objective function can be efficiently approximated. To achieve this, we rely on structural results on approximation of log-concave densities and leverage classical algorithmic tools on volume approximation of convex bodies and uniform sampling from convex sets.

Faster Algorithms for High-Dimensional Robust Covariance Estimation [abstract] [arxiv]
Y. Cheng, I. Diakonikolas, R. Ge, D. Woodruff
Proceedings of the 32nd Annual Conference on Learning Theory (COLT 2019)
We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal error guarantees for several natural structured distributions. Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance.
Given $N = \tilde{\Omega}(d^2/\epsilon^2)$ samples from a $d$-dimensional Gaussian distribution, an $\epsilon$-fraction of which may be arbitrarily corrupted, our algorithm runs in time $\tilde{O}(d^{3.26})/\poly(\epsilon)$ and approximates the unknown covariance matrix to optimal error up to a logarithmic factor. Previous robust algorithms with comparable error guarantees all have runtimes $\tilde{\Omega}(d^{2 \omega})$ when $\epsilon = \Omega(1)$, where $\omega$ is the exponent of matrix multiplication. We also provide evidence that improving the running time of our algorithm may require new algorithmic techniques.

Communication and Memory Efficient Testing of Discrete Distributions [abstract] [arxiv]
I. Diakonikolas, T. Gouleakis, D. Kane, S. Rao
Proceedings of the 32nd Annual Conference on Learning Theory (COLT 2019)
We study distribution testing with communication and memory constraints in the following computational models: (1) The {\em one-pass streaming model} where the goal is to minimize the sample complexity of the protocol subject to a memory constraint, and (2) A {\em distributed model} where the data samples reside at multiple machines and the goal is to minimize the communication cost of the protocol. In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing). Moreover, we show nearly-tight lower bounds on (1) the sample complexity of any one-pass streaming tester for uniformity, subject to the memory constraint, and (2) the communication cost of any uniformity testing protocol, in a restricted ``one-pass'' model of communication.

Testing Identity of Multidimensional Histograms [abstract] [arxiv]
I. Diakonikolas, D. Kane, J. Peebles
Proceedings of the 32nd Annual Conference on Learning Theory (COLT 2019)
We investigate the problem of identity testing for multidimensional histogram distributions. A distribution $p: D \to \R_+$, where $D \subseteq \R^d$, is called a $k$-histogram if there exists a partition of the domain into $k$ axis-aligned rectangles such that $p$ is constant within each such rectangle. Histograms are one of the most fundamental non-parametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound.
More specifically, let $q$ be an unknown $d$-dimensional $k$-histogram and $p$ be an explicitly given $k$-histogram. We want to correctly distinguish, with probability at least $2/3$, between the case that $p = q$ versus $\|p-q\|_1 \geq \eps$. We design a computationally efficient algorithm for this hypothesis testing problem with sample complexity $O((\sqrt{k}/\eps^2) \log^{O(d)}(k/\eps))$. Our algorithm is robust to model misspecification, i.e., succeeds even if $q$ is only promised to be {\em close} to a $k$-histogram. Moreover, for $k = 2^{\Omega(d)}$, we show a nearly-matching sample complexity lower bound of $\Omega((\sqrt{k}/\eps^2) (\log(k/\eps)/d)^{\Omega(d)})$ when $d\geq 2$.
Prior to our work, the sample complexity of the $d=1$ case was well-understood, but no algorithm with sub-learning sample complexity was known, even for $d=2$. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.

Sever: A Robust Meta-Algorithm for Stochastic Optimization [abstract] [arxiv] [code]
I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, A. Stewart
Proceedings of the 36th International Conference on Machine Learning (ICML 2019)
Oral Presentation at the NeurIPS 2018 Workshop on Security in Machine Learning (SECML 2018)
In high dimensions, most machine learning methods are brittle to even a small fraction of structured outliers. To address this, we introduce a new meta-algorithm that can take in a \emph{base learner} such as least squares or stochastic gradient descent, and harden the learner to be resistant to outliers. Our method, Sever, possesses strong theoretical guarantees yet is also highly scalable---beyond running the base learner itself, it only requires computing the top singular vector of a certain $n \times d$ matrix. We apply Sever on a drug design dataset and a spam classification dataset, and find that in both cases it has substantially greater robustness than several baselines. On the spam dataset, with $1\%$ corruptions, we achieved $7.4\%$ test error, compared to $13.4\%-20.5\%$ for the baselines, and $3\%$ error on the uncorrupted dataset. Similarly, on the drug design dataset, with $10\%$ corruptions, we achieved $1.42$ mean-squared test error, compared to $1.51$-$2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.

Degree-d Chow Parameters Robustly Determine Degree-d PTFs (and Algorithmic Applications) [abstract] [eccc]
I. Diakonikolas, D. Kane
Proceedings of the 51st Annual ACM Symposium on Theory of Computing (STOC 2019)
The degree-$d$ Chow parameters of a Boolean function are its degree at most $d$ Fourier coefficients. It is well-known that degree-$d$ Chow parameters uniquely characterize degree-$d$ polynomial threshold functions (PTFs) within the space of all bounded functions. In this paper, we prove a robust version of this theorem: For $f$ any Boolean degree-$d$ PTF and $g$ any bounded function, if the degree-$d$ Chow parameters of $f$ are close to the degree-$d$ Chow parameters of $g$ in $\ell_2$-norm, then $f$ is close to $g$ in $\ell_1$-distance. Notably, our bound relating the two distances is independent of the dimension. That is, we show that Boolean degree-$d$ PTFs are {\em robustly identifiable} from their degree-$d$ Chow parameters. No non-trivial bound was previously known for $d >1$.
Our robust identifiability result gives the following algorithmic applications: First, we show that Boolean degree-$d$ PTFs can be efficiently approximately reconstructed from approximations to their degree-$d$ Chow parameters. This immediately implies that degree-$d$ PTFs are efficiently learnable in the uniform distribution $d$-RFA model. As a byproduct of our approach, we also obtain the first low integer-weight approximations of degree-$d$ PTFs, for $d>1$. As our second application, our robust identifiability result gives the first efficient algorithm, with dimension-independent error guarantees, for malicious learning of Boolean degree-$d$ PTFs under the uniform distribution.
The proof of our robust identifiability result involves several new technical ingredients, including the following structural result for degree-$d$ multivariate polynomials with very poor anti-concentration: If $p$ is a degree-$d$ polynomial where $p(x)$ is {\em very} close to $0$ on a {\em large} number of points in $\{ \pm 1\}^n$, then there exists a degree-$d$ hypersurface that exactly passes though {\em almost all} of these points. We leverage this structural result to show that if the degree-$d$ Chow distance between $f$ and $g$ is small, then we can find many degree-$d$ polynomials that vanish on their disagreement region, and in particular enough that forces the $\ell_1$-distance between $f$ and $g$ to also be small. To implement this proof strategy, we require additional technical ideas. In particular, in the $d=2$ case we show that for any large vector space of degree-$2$ polynomials with a large number of common zeroes, there exists a linear function that vanishes on almost all of these zeroes. The degree-$d$ degree generalization of this statement is significantly more complex, and can be viewed as an effective version of Hilbert's Basis Theorem for our setting.

Efficient Robust Proper Learning of Log-concave Distributions [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Manuscript
We study the {\it robust proper learning} of univariate log-concave distributions (over continuous and discrete domains). Given a set of samples drawn from an unknown target distribution, we want to compute a log-concave hypothesis distribution that is as close as possible to the target, in total variation distance. In this work, we give the first computationally efficient algorithm for this learning problem. Our algorithm achieves the information-theoretically optimal sample size (up to a constant factor), runs in polynomial time, and is robust to model misspecification with nearly-optimal error guarantees.
Specifically, we give an algorithm that, on input $n=O(1/\epsilon^{5/2})$ samples from an unknown distribution $f$, runs in time $\widetilde{O}(n^{8/5})$, and outputs a log-concave hypothesis $h$ that (with high probability) satisfies $d_{\mathrm{TV}}(h, f) = O(\mathrm{opt})+\epsilon$, where $\mathrm{opt}$ is the minimum total variation distance between $f$ and the class of log-concave distributions. Our approach to the robust proper learning problem is quite flexible and may be applicable to many other univariate distribution families.

On the Complexity of the Inverse Semivalue Problem for Weighted Voting Games [abstract] [arxiv]
I. Diakonikolas, C. Pavlou
Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019)
Weighted voting games are a family of cooperative games, typically used to model voting situations where a number of agents (players) vote against or for a proposal. In such games, a proposal is accepted if an appropriately weighted sum of the votes exceeds a prespecified threshold. As the influence of a player over the voting outcome is not in general proportional to her assigned weight, various power indices have been proposed to measure each player's influence. The inverse power index problem is the problem of designing a weighted voting game that achieves a set of target influences according to a predefined power index. In this work, we study the computational complexity of the inverse problem when the power index belongs to the class of semivalues. We prove that the inverse problem is computationally intractable for a broad family of semivalues, including all regular semivalues. As a special case of our general result, we establish computational hardness of the inverse problem for the Banzhaf indices and the Shapley values, arguably the most popular power indices.

Collision-based Testers are Optimal for Uniformity and Closeness [abstract] [eccc]
I. Diakonikolas, T. Gouleakis, J. Peebles, E. Price
Chicago Journal of Theoretical Computer Science, 2019
Also see Oded Goldreich's exposition of our proof here
We study the fundamental problems of (i) uniformity testing of a discrete distribution, and (ii) closeness testing between two discrete distributions with bounded $\ell_2$-norm. These problems have been extensively studied in distribution testing and sample-optimal estimators are known for them~\cite{Paninski:08, CDVV14, VV14, DKN:15}.
In this work, we show that the original collision-based testers proposed for these problems ~\cite{GRdist:00, BFR+:00} are sample-optimal, up to constant factors. Previous analyses showed sample complexity upper bounds for these testers that are optimal as a function of the domain size $n$, but suboptimal by polynomial factors in the error parameter $\epsilon$. Our main contribution is a new tight analysis establishing that these collision-based testers are information-theoretically optimal, up to constant factors, both in the dependence on $n$ and in the dependence on $\epsilon$.

High-Dimensional Robust Mean Estimation in Nearly-Linear Time [abstract] [arxiv]
Y. Cheng, I. Diakonikolas, R. Ge
Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2019)
We study the fundamental problem of high-dimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimension-independent error guarantees for several families of structured distributions.
In this work, we give the first nearly-linear time algorithms for high-dimensional robust mean estimation. Specifically, we focus on distributions with (i) known covariance and sub-gaussian tails, and (ii) unknown bounded covariance. Given $N$ samples on $\mathbb{R}^d$, an $\epsilon$-fraction of which may be arbitrarily corrupted, our algorithms run in time $\tilde{O}(Nd) / \mathrm{poly}(\epsilon)$ and approximate the true mean within the information-theoretically optimal error, up to constant factors. Previous robust algorithms with comparable error guarantees have running times $\tilde{\Omega}(N d^2)$, for $\epsilon = \Omega(1)$.
Our algorithms rely on a natural family of SDPs parameterized by our current guess $\nu$ for the unknown mean $\mu^\star$. We give a win-win analysis establishing the following: either a near-optimal solution to the primal SDP yields a good candidate for $\mu^\star$ -- independent of our current guess $\nu$ -- or the dual SDP yields a new guess $\nu'$ whose distance from $\mu^\star$ is smaller by a constant factor. We exploit the special structure of the corresponding SDPs to show that they are approximately solvable in nearly-linear time. Our approach is quite general, and we believe it can also be applied to obtain nearly-linear time algorithms for other high-dimensional robust learning problems.

Efficient Algorithms and Lower Bounds for Robust Linear Regression [abstract] [arxiv]
I. Diakonikolas, W. Kong, A. Stewart
Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2019)
We study the prototypical problem of high-dimensional linear regression in a robust model where an $\epsilon$-fraction of the samples can be adversarially corrupted. We focus on the fundamental setting where the covariates of the uncorrupted samples are drawn from a Gaussian distribution $\mathcal{N}(0, \Sigma)$ on $\R^d$. We give nearly tight upper bounds and computational lower bounds for this problem. Specifically, our main contributions are as follows:
For the case that the covariance matrix is known to be the identity, we give a sample near-optimal and computationally efficient algorithm that draws $\tilde{O}(d/\eps^2)$ labeled examples and outputs a candidate hypothesis vector $\widehat{\beta}$ that approximates the unknown regression vector $\beta$ within $\ell_2$-norm $O(\eps \log(1/\eps) \sigma)$, where $\sigma$ is the standard deviation of the random observation noise. An error of $\Omega (\eps \sigma)$ is information-theoretically necessary, even with infinite sample size. Hence, the error guarantee of our algorithm is optimal, up to a logarithmic factor in $1/\eps$. Prior work gave an algorithm for this problem with sample complexity $\tilde{\Omega}(d^2/\eps^2)$ whose error guarantee scales with the $\ell_2$-norm of $\beta$.
For the case of unknown covariance $\Sigma$, we show that we can efficiently achieve the same error guarantee of $O(\eps \log(1/\eps) \sigma)$, as in the known covariance case, using an additional $\tilde{O}(d^2/\eps^2)$ unlabeled examples. On the other hand, an error of $O(\eps \sigma)$ can be information-theoretically attained with $O(d/\eps^2)$ samples. We prove a Statistical Query (SQ) lower bound providing evidence that this quadratic tradeoff in the sample size is inherent. More specifically, we show that any polynomial time SQ learning algorithm for robust linear regression (in Huber's contamination model) with estimation complexity $O(d^{2-c})$, where $c>0$ is an arbitrarily small constant, must incur an error of $\Omega(\sqrt{\eps} \sigma)$.

Fourier-based Testing for Families of Distributions [abstract] [eccc]
C. Canonne, I. Diakonikolas, A. Stewart
Advances in Neural Information Processing Systems (NeurIPS 2018)
We study the general problem of testing whether an unknown discrete distribution belongs to a given family of distributions. More specifically, given a class of distributions $\mathcal{P}$ and sample access to an unknown distribution $p$, we want to distinguish (with high probability) between the case that $p \in \mathcal{P}$ and the case that $p$ is $\epsilon$-far, in total variation distance, from every distribution in $\mathcal{P}$. This is the prototypical hypothesis testing problem that has received significant attention in statistics and, more recently, in theoretical computer science.
The sample complexity of this general problem depends on the underlying family $\mathcal{P}$. We are interested in designing sample-optimal and computationally efficient algorithms for this task. The main contribution of this work is a new and simple testing technique that is applicable to distribution families whose \emph{Fourier spectrum} approximately satisfies a certain \emph{sparsity} property. As the main applications of our Fourier-based testing technique, we obtain the first non-trivial testers for two fundamental families of discrete distributions: Sums of Independent Integer Random Variables (SIIRVs) and Poisson Multinomial Distributions (PMDs). Our testers for these families are nearly sample-optimal and computationally efficient. We also obtain a tester with improved sample complexity for discrete log-concave distributions. To the best of our knowledge, ours is the first use of the Fourier transform in the context of distribution testing.

Sharp Bounds for Generalized Uniformity Testing [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Advances in Neural Information Processing Systems (NeurIPS 2018)
Selected for Spotlight Presentation at NeurIPS 2018
See the comments in Oded Goldreich's Choices #229
We study the problem of {\em generalized uniformity testing}~\cite{BC17} of a discrete probability distribution: Given samples from a probability distribution $p$ over an {\em unknown} discrete domain $\mathbf{\Omega}$, we want to distinguish, with probability at least $2/3$, between the case that $p$ is uniform on some {\em subset} of $\mathbf{\Omega}$ versus $\epsilon$-far, in total variation distance, from any such uniform distribution. We establish tight bounds on the sample complexity of generalized uniformity testing. In more detail, we present a computationally efficient tester whose sample complexity is optimal, up to constant factors, and a matching information-theoretic lower bound. Specifically, we show that the sample complexity of generalized uniformity testing is $\Theta\left(1/(\epsilon^{4/3}\|p\|_3) + 1/(\epsilon^{2} \|p\|_2) \right)$.

Robust Learning of Fixed-Structure Bayesian Networks [abstract] [arxiv]
Y. Cheng, I. Diakonikolas, D. Kane, A. Stewart
Advances in Neural Information Processing Systems (NeurIPS 2018)
We investigate the problem of learning Bayesian networks in an agnostic model where an $\epsilon$-fraction of the samples are adversarially corrupted. Our agnostic learning model is similar to -- in fact, stronger than -- Huber's contamination model in robust statistics. In this work, we study the fully observable Bernoulli case where the structure of the network is given. Even in this basic setting, previous learning algorithms either run in exponential time or lose dimension-dependent factors in their error guarantees. We provide the first computationally efficient agnostic learning algorithm for this problem with dimension-independent error guarantees. Our algorithm has polynomial sample complexity, runs in polynomial time, and achieves error that scales nearly-linearly with the fraction of adversarially corrupted samples.

Differentially Private Identity and Closeness Testing of Discrete Distributions [abstract] [arxiv]
M. Aliakbarpour, I. Diakonikolas, R. Rubinfeld
Proceedings of the 35th International Conference on Machine Learning (ICML 2018)
We investigate the problems of identity and closeness testing over a discrete population from random samples. Our goal is to develop efficient testers while guaranteeing Differential Privacy to the individuals of the population. We describe an approach that yields sample-efficient differentially private testers for these problems. Our theoretical results show that there exist private identity and closeness testers that are nearly as sample-efficient as their non-private counterparts. We perform an experimental evaluation of our algorithms on synthetic data. Our experiments illustrate that our private testers achieve small type I and type II errors with sample size {\em sublinear} in the domain size of the underlying distributions.

Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities [abstract] [arxiv]
T. Carpenter, I. Diakonikolas, A. Sidiropoulos, A. Stewart
Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018)

Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms [abstract] [arxiv]
I. Diakonikolas, J. Li, L. Schmidt
Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018)

Sample-Optimal Identity Testing with High Probability [abstract] [arxiv]
I. Diakonikolas, T. Gouleakis, J. Peebles, E. Price
Proceedings of the 45th Intl. Colloquium on Automata, Languages and Programming (ICALP 2018)
See the comments in Oded Goldreich's Choices #229
We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $0< \epsilon, \delta < 1$, we wish to distinguish, {\em with probability at least $1-\delta$}, whether the distributions are identical versus $\epsilon$-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that $\delta = \Omega(1)$, for which the sample complexity of identity testing is known to be $\Theta(\sqrt{n}/\epsilon^2)$.
Typical applications of distribution property testing require small values of the confidence parameter $\delta$ (which correspond to small ``$p$-values'' in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of $\delta$ via black-box amplification, which multiplies the required number of samples by $\Theta(\log(1/\delta))$. We show that this upper bound is suboptimal for any $\delta = o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is \[ \Theta\left( \frac{1}{\epsilon^2}\left(\sqrt{n \log(1/\delta)} + \log(1/\delta) \right)\right) \] for any $n, \epsilon$, and $\delta$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $\mathrm{d}_{TV}(p, U_n) \geq \epsilon$, we simply threshold $\mathrm{d}_{TV}(\hat{p}, U_n)$, where $\hat{p}$ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well.

Testing Conditional Independence of Discrete Distributions [abstract] [arxiv]
C. Canonne, I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 50th Annual ACM Symposium on Theory of Computing (STOC 2018)

List-Decodable Robust Mean Estimation and Learning Mixtures of Spherical Gaussians [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 50th Annual ACM Symposium on Theory of Computing (STOC 2018)

Learning Geometric Concepts with Nasty Noise [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 50th Annual ACM Symposium on Theory of Computing (STOC 2018)

Robustly Learning a Gaussian: Getting Optimal Error, Efficiently [abstract] [arxiv]
I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart
Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2018)
We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise - where an $\epsilon$-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error $O(\epsilon)$ in the total variation distance, which is optimal up to a universal constant that is independent of the dimension.
In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$ and $1/\epsilon$. When both the mean and covariance are unknown, the running time is polynomial in $d$ and quasipolynomial in $1/\epsilon$. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings.

Communication-Efficient Distributed Learning of Discrete Distributions [abstract] [proceedings]
I. Diakonikolas, E. Grigorescu, J. Li, A. Natarajan, K. Onak, L. Schmidt
Advances in Neural Information Processing Systems (NIPS 2017)
Selected for Oral Presentation at NIPS 2017
We initiate a systematic investigation of density estimation when the data is {\em distributed} across multiple servers. The servers must communicate with a referee and the goal is to estimate the underlying distribution with as few bits of communication as possible. We focus on non-parametric density estimation of discrete distributions with respect to the $\ell_1$ and $\ell_2$ norms. We provide the first non-trivial upper and lower bounds on the communication complexity of this basic estimation task in various settings of interest.
When the unknown discrete distribution is {\em unstructured} and each server has only one sample, we show that any {\em blackboard} protocol (i.e., any protocol in which servers interact arbitrarily using public messages) that learns the distribution must essentially communicate the entire sample. For the case of {\em structured} distributions, such as $k$-histograms and monotone distributions, we design distributed learning algorithms that achieve significantly better communication guarantees than the naive ones, and obtain tight upper and lower bounds in several regimes. Our distributed learning algorithms run in near-linear time and are robust to model misspecification.
Our results provide insights on the interplay between structure and communication efficiency for a range of fundamental distribution estimation tasks.

Statistical Query Lower Bounds for Robust Estimation of High-dimensional Gaussians and Gaussian Mixtures [abstract] [eccc] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2017)
See here and here for videos of related talks.
We prove the first {\em Statistical Query lower bounds} for two fundamental high-dimensional learning problems involving Gaussian distributions: (1) learning Gaussian mixture models (GMMs), and (2) robust (agnostic) learning of a single unknown mean Gaussian. In particular, we show a {\em super-polynomial gap} between the (information-theoretic) sample complexity and the complexity of {\em any} Statistical Query algorithm for these problems. Statistical Query (SQ) algorithms are a class of algorithms that are only allowed to query expectations of functions of the distribution rather than directly access samples. This class of algorithms is quite broad: with the sole exception of Gaussian elimination over finite fields, all known algorithmic approaches in machine learning can be implemented in this model.
Our SQ lower bound for Problem (1) is qualitatively matched by known learning algorithms for GMMs (all of which can be implemented as SQ algorithms). At a conceptual level, this result implies that -- as far as SQ algorithms are concerned -- the computational complexity of learning GMMs is inherently exponential {\it in the dimension of the latent space} -- even though there is no such information-theoretic barrier. Our lower bound for Problem (2) implies that the accuracy of the robust learning algorithm in~\cite{DiakonikolasKKLMS16} is essentially best possible among all polynomial-time SQ algorithms. On the positive side, we give a new SQ learning algorithm for this problem with optimal accuracy whose running time nearly matches our lower bound. Both our SQ lower bounds are attained via a unified moment-matching technique that may be useful in other contexts. Our SQ learning algorithm for Problem (2) relies on a filtering technique that removes outliers based on higher-order tensors.
Our lower bound technique also has implications for related inference problems, specifically for the problem of robust {\it testing} of an unknown mean Gaussian. Here we show an information-theoretic lower bound which separates the sample complexity of the robust testing problem from its non-robust variant. This result is surprising because such a separation does not exist for the corresponding learning problem.

Being Robust (in High Dimensions) Can be Practical [abstract] [arxiv] [code]
I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart
Proceedings of the 34th International Conference on Machine Learning (ICML 2017)
Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.

Testing Bayesian Networks [abstract] [arxiv]
C. Canonne, I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 30th Annual Conference on Learning Theory (COLT 2017)
Journal version in IEEE Transactions on Information Theory, to appear.
This work initiates a systematic investigation of testing {\em high-dimensional} structured distributions by focusing on testing {\em Bayesian networks} -- the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other non-descendant nodes once its parents are fixed. Specifically, we study the properties of identity testing and closeness testing of Bayesian networks. Our main contribution is the first non-trivial efficient testing algorithms for these problems and corresponding information-theoretic lower bounds. For a wide range of parameter settings, our testing algorithms have sample complexity {\em sublinear} in the dimension and are sample-optimal, up to constant factors.

Learning Multivariate Log-concave Distributions [abstract] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 30th Annual Conference on Learning Theory (COLT 2017)

Near-optimal Closeness Testing of Discrete Histogram Distributions [abstract] [arxiv]
I. Diakonikolas, D. Kane, V. Nikishkin
Proceedings of the 44th Intl. Colloquium on Automata, Languages and Programming (ICALP 2017)
We investigate the problem of testing the equivalence between two discrete histograms. A {\em $k$-histogram} over $[n]$ is a probability distribution that is piecewise constant over some set of $k$ intervals over $[n]$. Histograms have been extensively studied in computer science and statistics. Given a set of samples from two $k$-histogram distributions $p, q$ over $[n]$, we want to distinguish (with high probability) between the cases that $p = q$ and $\|p-q\|_1 \geq \epsilon$. The main contribution of this paper is a new algorithm for this testing problem and a nearly matching information-theoretic lower bound. Specifically, the sample complexity of our algorithm matches our lower bound up to a logarithmic factor, improving on previous work by polynomial factors in the relevant parameters. Our algorithmic approach applies in a more general framework and yields improved sample upper bounds for testing closeness of other structured distributions as well.

Playing Anonymous Games using Simple Strategies [abstract] [arxiv]
Y. Cheng, I. Diakonikolas, A. Stewart
Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2017)
We investigate the complexity of computing approximate Nash equilibria in anonymous games. Our main algorithmic result is the following: For any $n$-player anonymous game with a bounded number of strategies and any constant $\delta>0$, an $O(1/n^{1-\delta})$-approximate Nash equilibrium can be computed in polynomial time. Complementing this positive result, we show that if there exists any constant $\delta>0$ such that an $O(1/n^{1+\delta})$-approximate equilibrium can be computed in polynomial time, then there is a fully polynomial-time approximation scheme for this problem.
We also present a faster algorithm that, for any $n$-player $k$-strategy anonymous game, runs in time $\tilde O((n+k) k n^k)$ and computes an $\tilde O(n^{-1/3} k^{11/3})$-approximate equilibrium. This algorithm follows from the existence of simple approximate equilibria of anonymous games, where each player plays one strategy with probability $1-\delta$, for some small $\delta$, and plays uniformly at random with probability $\delta$.
Our approach exploits the connection between Nash equilibria in anonymous games and Poisson multinomial distributions (PMDs). Specifically, we prove a new probabilistic lemma establishing the following: Two PMDs, with large variance in each direction, whose first few moments are approximately matching are close in total variation distance. Our structural result strengthens previous work by providing a smooth tradeoff between the variance bound and the number of matching moments.

Sample Optimal Density Estimation in Nearly-Linear Time [abstract] [pdf] [code]
J. Acharya, I. Diakonikolas, J. Li, L. Schmidt
Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2017)

Robust Estimators in High Dimensions without the Computational Intractability [abstract] [arxiv]
I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart
Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016)
SIAM Journal on Computing Special Issue for FOCS 2016
Invited to Communications of the ACM, Research Highlights
See here for my Simons Institute tutorial on the topic and here for a previous related talk.

A New Approach for Testing Properties of Discrete Distributions [abstract] [pdf] [arxiv]
I. Diakonikolas, D. Kane
Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016)
Also see the comments in Oded Goldreich's Choices #188 and #195, and Oded's very nice exposition of our framework here

Fast Algorithms for Segmented Regression [abstract] [pdf] [code]
J. Acharya, I. Diakonikolas, J. Li, L. Schmidt
Proceedings of the 33rd International Conference on Machine Learning (ICML 2016)

Properly Learning Poisson Binomial Distributions in Almost Polynomial Time [abstract] [pdf] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 29th Annual Conference on Learning Theory (COLT 2016)

Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables [abstract] [pdf] [arxiv] [notes]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 29th Annual Conference on Learning Theory (COLT 2016)
We study the structure and learnability of sums of independent integer random variables (SIIRVs). For $k \in \mathbb{Z}_{+}$, a {\em $k$-SIIRV of order $n \in \mathbb{Z}_{+}$} is the probability distribution of the sum of $n$ mutually independent random variables each supported on $\{0, 1, \dots, k-1\}$. We denote by ${\cal S}_{n,k}$ the set of all $k$-SIIRVs of order $n$.
How many samples are required to learn an arbitrary distribution in ${\cal S}_{n,k}$? In this paper, we tightly characterize the sample and computational complexity of this problem. More precisely, we design a computationally efficient algorithm that uses $\widetilde{O}(k/\epsilon^2)$ samples, and learns an arbitrary $k$-SIIRV within error $\epsilon,$ in total variation distance. Moreover, we show that the {\em optimal} sample complexity of this learning problem is $\Theta((k/\epsilon^2)\sqrt{\log(1/\epsilon)}),$ i.e., we prove an upper bound and a matching information-theoretic lower bound. Our algorithm proceeds by learning the Fourier transform of the target $k$-SIIRV in its effective support. Its correctness relies on the {\em approximate sparsity} of the Fourier transform of $k$-SIIRVs -- a structural property that we establish, roughly stating that the Fourier transform of $k$-SIIRVs has small magnitude outside a small set.
Along the way we prove several new structural results about $k$-SIIRVs. As one of our main structural contributions, we give an efficient algorithm to construct a sparse {\em proper} $\epsilon$-cover for ${\cal S}_{n,k},$ in total variation distance. We also obtain a novel geometric characterization of the space of $k$-SIIRVs. Our characterization allows us to prove a tight lower bound on the size of $\epsilon$-covers for ${\cal S}_{n,k}$ -- establishing that our cover upper bound is optimal -- and is the key ingredient in our tight sample complexity lower bound.
Our approach of exploiting the sparsity of the Fourier transform in distribution learning is general, and has recently found additional applications. In a subsequent work~\cite{DKS15c}, we use a generalization of this idea (in higher dimensions) to obtain the first efficient learning algorithm for Poisson multinomial distributions. In~\cite{DKS15b}, we build on this approach to obtain the fastest known proper learning algorithm for Poisson binomial distributions ($2$-SIIRVs).

The Fourier Transform of Poisson Multinomial Distributions and its Algorithmic Applications [abstract] [pdf] [arxiv]
I. Diakonikolas, D. Kane, A. Stewart
Proceedings of the 48th Annual ACM Symposium on Theory of Computing (STOC 2016)
We study Poisson Multinomial Distributions -- a fundamental family of discrete distributions that generalize the binomial and multinomial distributions, and are commonly encountered in computer science. Formally, an $(n, k)$-Poisson Multinomial Distribution (PMD) is a random variable of the form $X = \sum_{i=1}^n X_i$, where the $X_i$'s are independent random vectors supported on the set $\{e_1, e_2, \ldots, e_k \}$ of standard basis vectors in $\mathbb{R}^k$. In this paper, we obtain a refined structural understanding of PMDs by analyzing their Fourier transform. As our core structural result, we prove that the Fourier transform of PMDs is \emph{approximately sparse}, i.e., roughly speaking, its $L_1$-norm is small outside a small set. By building on this result, we obtain the following applications:
Learning Theory. We design the first computationally efficient learning algorithm for PMDs with respect to the total variation distance. Our algorithm learns an arbitrary $(n, k)$-PMD within variation distance $\epsilon$ using a near-optimal sample size of $\widetilde{O}_k(1/\epsilon^2),$ and runs in time $\widetilde{O}_k(1/\epsilon^2) \cdot \log n.$ Previously, no algorithm with a $\mathrm{poly}(1/\epsilon)$ runtime was known, even for $k=3.$
Game Theory. We give the first efficient polynomial-time approximation scheme (EPTAS) for computing Nash equilibria in anonymous games. For normalized anonymous games with $n$ players and $k$ strategies, our algorithm computes a well-supported $\epsilon$-Nash equilibrium in time $n^{O(k^3)} \cdot (k/\epsilon)^{O(k^3\log(k/\epsilon)/\log\log(k/\epsilon))^{k-1}}.$ The best previous algorithm for this problem~\cite{DaskalakisP08, DaskalakisP2014} had running time $n^{(f(k)/\epsilon)^k},$ where $f(k) = \Omega(k^{k^2})$, for any $k>2.$
Statistics. We prove a multivariate central limit theorem (CLT) that relates an arbitrary PMD to a discretized multivariate Gaussian with the same mean and covariance, in total variation distance. Our new CLT strengthens the CLT of Valiant and Valiant~\cite{VV10b, ValiantValiant:11} by completely removing the dependence on $n$ in the error bound.
Along the way we prove several new structural results of independent interest about PMDs. These include: (i) a robust moment-matching lemma, roughly stating that two PMDs that approximately agree on their low-degree parameter moments are close in variation distance; (ii) near-optimal size proper $\epsilon$-covers for PMDs in total variation distance (constructive upper bound and nearly-matching lower bound). In addition to Fourier analysis, we employ a number of analytic tools, including the saddlepoint method from complex analysis, that may find other applications.

Testing Shape Restrictions of Discrete Distributions [abstract] [pdf]
C. Canonne, I. Diakonikolas, T. Gouleakis, R. Rubinfeld
Proceedings of the 33rd International Symposium on Theoretical Aspects of Computer Science (STACS 2016)
Invited to Special Issue for STACS 2016
We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution $D$ over $[n]$ and a property $\mathcal{P}$, the goal is to distinguish between $D\in \mathcal{P}$ and $D$ is $\epsilon$-far in $\ell_1$ distance from $\mathcal{P}$.
We develop a general algorithm for this question, which applies to a large range of "shape-constrained" properties, including monotone, log-concave, $t$-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly-tight upper and lower bounds for the corresponding questions.

Differentially Private Learning of Structured Discrete Distributions [abstract] [pdf] [code]
I. Diakonikolas, M. Hardt, L. Schmidt
Advances in Neural Information Processing Systems (NIPS 2015)

Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions [abstract] [pdf]
I. Diakonikolas, D. Kane, V. Nikishkin
Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015)

On the Complexity of Optimal Lottery Pricing and Randomized Mechanisms [abstract] [pdf]
X. Chen, I. Diakonikolas, A. Orfanou, D. Paparas, X. Sun, M. Yannakakis
Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015)
We study the optimal lottery problem and the optimal mechanism design problem in the setting of a single unit-demand buyer with item values drawn from independent distributions. Optimal solutions to both problems are characterized by a linear program with exponentially many variables.
For the menu size complexity of the optimal lottery problem, we present an explicit, simple instance with distributions of support size $2$, and show that exponentially many lotteries are required to achieve the optimal revenue. We also show that, when distributions have support size $2$ and share the same high value, the simpler scheme of item pricing can achieve the same revenue as the optimal menu of lotteries. The same holds for the case of two items with support size $2$ (but not necessarily the same high value).
For the computational complexity of the optimal mechanism design problem, we show that unless the polynomial-time hierarchy collapses (more precisely, $\mathrm{P}^{\mathrm{NP}}=\mathrm{P}^{\mathrm{\#P}}$), there is no universal efficient randomized algorithm to implement an optimal mechanism even when distributions have support size $3$.

Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms [abstract] [pdf]
J. Acharya, I. Diakonikolas, C. Hegde, J. Li, L. Schmidt
Proceedings of the 34th Annual ACM Symposium on Principles of Database Systems (PODS 2015)
Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms.
A $k$-histogram is a piecewise constant function with $k$ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: Given samples from a distribution $p$ over $\{1, \ldots, n \}$, compute a $k$-histogram that minimizes the $\ell_2$-distance from $p$, up to an additive $\epsilon$. We design an algorithm for this problem that uses the information--theoretically minimal sample size of $m = O(1/\epsilon^2)$, runs in sample--linear time $O(m)$, and outputs an $O(k)$-- histogram whose $\ell_2$-distance from $p$ is at most $O(\mathrm{opt}_k) +\epsilon$, where $\mathrm{opt}_k$ is the minimum $\ell_2$-distance between $p$ and any $k$-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size $n$.
We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.

Testing Identity of Structured Distributions [abstract] [pdf]
I. Diakonikolas, D. Kane, V. Nikishkin
Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2015)
We study the question of identity testing for structured distributions. More precisely, given samples from a {\em structured} distribution $q$ over $[n]$ and an explicit distribution $p$ over $[n]$, we wish to distinguish whether $q=p$ versus $q$ is at least $\epsilon$-far from $p$, in $L_1$ distance. In this work, we present a unified approach that yields new, simple testers, with sample complexity that is information-theoretically optimal, for broad classes of structured distributions, including $t$-flat distributions, $t$-modal distributions, log-concave distributions, monotone hazard rate (MHR) distributions, and mixtures thereof.

Learning from Satisfying Assignments [abstract] [pdf]
A. De, I. Diakonikolas, R. Servedio
Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2015)
This paper studies the problem of learning ``low-complexity" probability distributions over the Boolean hypercube $\{-1,1\}^n$. As in the standard PAC learning model, a learning problem in our framework is defined by a class ${\cal C}$ of Boolean functions over $\{-1,1\}^n$, but in our model the learning algorithm is given uniform random satisfying assignments of an unknown $f \in \cal{C}$ and its goal is to output a high-accuracy approximation of the uniform distribution over $f^{-1}(1).$ This distribution learning problem may be viewed as a demanding variant of standard Boolean function learning, where the learning algorithm only receives positive examples and -- more importantly -- must output a hypothesis function which has small \emph{multiplicative} error (i.e., small error relative to the size of $f^{-1}(1)$).
As our main results, we show that the two most widely studied classes of Boolean functions in computational learning theory -- linear threshold functions and DNF formulas -- have efficient distribution learning algorithms in our model. Our algorithm for linear threshold functions runs in time poly$(n,1/\epsilon)$ and our algorithm for polynomial-size DNF runs in time quasipoly$(n,1/\epsilon)$. We obtain both these results via a general approach that combines a broad range of technical ingredients, including the complexity-theoretic study of approximate counting and uniform generation; the Statistical Query model from learning theory; and hypothesis testing techniques from statistics. A key conceptual and technical ingredient of this approach is a new kind of algorithm which we devise called a ``densifier'' and which we believe may be useful in other contexts.
We also establish limitations on efficient learnability in our model by showing that the existence of certain types of cryptographic signature schemes imply that certain learning problems in our framework are computationally hard. Via this connection we show that assuming the existence of sufficiently strong unique signature schemes, there are no sub-exponential time learning algorithms in our framework for intersections of two halfspaces, for degree-2 polynomial threshold functions, or for monotone 2-CNF formulas. Thus our positive results for distribution learning come close to the limits of what can be achieved by efficient algorithms.

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms [abstract] [pdf]
S. Chan, I. Diakonikolas, R. Servedio, X. Sun
Advances in Neural Information Processing Systems (NIPS 2014)
Let $p$ be an unknown and arbitrary probability distribution over $[0,1)$. We consider the problem of density estimation, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function.
In more detail, for any $k$ and $\epsilon$, we give an algorithm that makes $\tilde{O}(k/\epsilon^2)$ draws from $p$, runs in $\tilde{O}(k/\epsilon^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \log^2(1/\epsilon))$ pieces. With high probability the hypothesis $h$ satisfies $d_{\mathrm{TV}}(p,h) \leq C \cdot \mathrm{opt}_k(p) + \epsilon$, where $d_{\mathrm{TV}}$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\mathrm{opt}_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are optimal up to logarithmic factors. The ``approximation factor'' $C$ in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $\epsilon$ can achieve $C<2$ regardless of what kind of hypothesis distribution it uses.

Efficient Density Estimation via Piecewise Polynomial Approximation [abstract] [pdf]
S. Chan, I. Diakonikolas, R. Servedio, X. Sun
Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC 2014)
We give a highly efficient "semi-agnostic" algorithm for learning univariate probability distributions that are well approximated by piecewise polynomial density functions. Let $p$ be an arbitrary distribution over an interval $I$ which is $\tau$-close (in total variation distance) to an unknown probability distribution $q$ that is defined by an unknown partition of $I$ into $t$ intervals and $t$ unknown degree-$d$ polynomials specifying $q$ over each of the intervals. We give an algorithm that draws $\tilde{O}(t(d+1)/\epsilon^2)$ samples from $p$, runs in time $\mathrm{poly}(t,d,1/\epsilon)$, and with high probability outputs a piecewise polynomial hypothesis distribution $h$ that is $(O(\tau)+\epsilon)$-close (in total variation distance) to $p$. This sample complexity is essentially optimal; we show that even for $\tau=0$, any algorithm that learns an unknown $t$-piecewise degree-$d$ probability distribution over $I$ to accuracy $\epsilon$ must use $\Omega({\frac {t(d+1)} {\mathrm{poly}(1 + \log(d+1))}} \cdot {\frac 1 {\epsilon^2}})$ samples from the distribution, regardless of its running time. Our algorithm combines tools from approximation theory, uniform convergence, linear programming, and dynamic programming.

We apply this general algorithm to obtain a wide range of results for many natural problems in density estimation over both continuous and discrete domains. These include state-of-the-art results for learning mixtures of log-concave distributions; mixtures of $t$-modal distributions; mixtures of Monotone Hazard Rate distributions; mixtures of Poisson Binomial Distributions; mixtures of Gaussians; and mixtures of $k$-monotone densities. Our general technique yields computationally efficient algorithms for all these problems, in many cases with provably optimal sample complexities (up to logarithmic factors) in all parameters.

Deterministic Approximate Counting for Juntas of Degree-$2$ Polynomial Threshold Functions [abstract] [pdf]
A. De, I. Diakonikolas, R. Servedio
Proceedings of the 29th Annual IEEE Conference on Computational Complexity (CCC 2014)
Let $g: \{-1,1\}^k \rightarrow \{-1,1\}$ be any Boolean function and $q_1,\dots,q_k$ be any degree-$2$ polynomials over $\{-1,1\}^n.$ We give a deterministic algorithm which, given as input explicit descriptions of $g,q_1,\dots,q_k$ and an accuracy parameter $\epsilon>0$, approximates \[ \mathbf{Pr}_{x \sim \{-1,1\}^n}[g(\mathrm{sign}(q_1(x)),\dots,\mathrm{sign}(q_k(x)))=1] \] to within an additive $\pm \epsilon$. For any constant $\epsilon > 0$ and $k \geq 1$ the running time of our algorithm is a fixed polynomial in $n$ (in fact this is true even for some not-too-small $\epsilon = o_n(1)$ and not-too-large $k = \omega_n(1)$). This is the first fixed polynomial-time algorithm that can deterministically approximately count satisfying assignments of a natural class of depth-$3$ Boolean circuits.
Our algorithm extends a recent result \cite{DDS13:deg2count} which gave a deterministic approximate counting algorithm for a single degree-$2$ polynomial threshold function $\mathrm{sign}(q(x)),$ corresponding to the $k=1$ case of our result. Note that even in the $k=1$ case it is NP-hard to determine whether $\mathbf{Pr}_{x \sim \{-1,1\}^n}[\mathrm{sign}(q(x))=1]$ is nonzero, so any sort of multiplicative approximation is almost certainly impossible even for efficient randomized algorithms.
Our algorithm and analysis requires several novel technical ingredients that go significantly beyond the tools required to handle the $k=1$ case in \cite{DDS13:deg2count}. One of these is a new multidimensional central limit theorem for degree-$2$ polynomials in Gaussian random variables which builds on recent Malliavin-calculus-based results from probability theory. We use this CLT as the basis of a new decomposition technique for $k$-tuples of degree-$2$ Gaussian polynomials and thus obtain an efficient deterministic approximate counting algorithm for the Gaussian distribution, i.e., an algorithm for estimating \[ \mathbf{Pr}_{x \sim \mathcal{N}(0,1)^n}[g(\mathrm{sign}(q_1(x)),\dots,\mathrm{sign}(q_k(x)))=1]. \] Finally, a third new ingredient is a ``regularity lemma'' for $k$-tuples of degree-$d$ polynomial threshold functions. This generalizes both the regularity lemmas of \cite{DSTW:10,HKM:09} (which apply to a single degree-$d$ polynomial threshold function) and the regularity lemma of Gopalan et al \cite{GOWZ10} (which applies to a $k$-tuples of linear threshold functions, i.e., the case $d=1$). Our new regularity lemma lets us extend our deterministic approximate counting results from the Gaussian to the Boolean domain.

The Complexity of Optimal Multidimensional Pricing [abstract] [pdf]
X. Chen, I. Diakonikolas, D. Paparas, X. Sun, M. Yannakakis
Games and Economic Behavior, accepted with minor revisions
Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2014)
We resolve the complexity of revenue-optimal deterministic auctions in the unit-demand single-buyer Bayesian setting, i.e., the optimal item pricing problem, when the buyer's values for the items are independent. We show that the problem of computing a revenue-optimal pricing can be solved in polynomial time for distributions of support size $2$ and its decision version is NP-complete for distributions of support size $3$. We also show that the problem remains NP-complete for the case of identical distributions.

Optimal Algorithms for Testing Closeness of Discrete Distributions [abstract] [pdf]
S. Chan, I. Diakonikolas, G. Valiant, P. Valiant
Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2014)
Blog post about this work: Property Testing Review
We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions $p$ and $q$ over an $n$-element set, we wish to distinguish whether $p=q$ versus $p$ is at least $\epsilon$-far from $q$, in either $\ell_1$ or $\ell_2$ distance. Batu et al~\cite{BFR+:00, Batu13} gave the first sub-linear time algorithms for these problems, which matched the lower bounds of~\cite{PV11sicomp} up to a logarithmic factor in $n$, and a polynomial factor of $\epsilon.$

In this work, we present simple testers for both the $\ell_1$ and $\ell_2$ settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on $n$, and the dependence on $\epsilon$; for the $\ell_1$ testing problem we establish that the sample complexity is $\Theta(\max\{n^{2/3}/\epsilon^{4/3}, n^{1/2}/\epsilon^2 \}).$

A Polynomial-time Approximation Scheme for Fault-tolerant Distributed Storage [abstract] [pdf]
C. Daskalakis, A. De, I. Diakonikolas, A. Moitra, R. Servedio
Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2014)
Featured in Abstract Talk. Podcast is here
We consider a problem which has received considerable attention in systems literature because of its applications to routing in delay tolerant networks and replica placement in distributed storage systems. In abstract terms the problem can be stated as follows: Given a random variable $X$ generated by a known product distribution over $\{0,1\}^n$ and a target value $0 \leq \theta \leq 1$, output a non-negative vector $w$, with $\|w\|_1 \le 1$, which maximizes the probability of the event $w \cdot X \ge \theta$. This is a challenging non-convex optimization problem for which even computing the value $\Pr[w \cdot X \ge \theta]$ of a proposed solution vector $w$ is #P-hard.
We provide an additive EPTAS for this problem which, for constant-bounded product distributions, runs in $ \mathrm{poly}(n) \cdot 2^{\mathrm{poly}(1/\epsilon)}$ time and outputs an $\epsilon$-approximately optimal solution vector $w$ for this problem. Our approach is inspired by, and extends, recent structural results from the complexity-theoretic study of linear threshold functions. Furthermore, in spite of the objective function being non-smooth, we give a unicriterion PTAS while previous work for such objective functions has typically led to a bicriterion PTAS. We believe our techniques may be applicable to get unicriterion PTAS for other non-smooth objective functions.

Learning Sums of Independent Integer Random Variables [abstract] [pdf]
C. Daskalakis, I. Diakonikolas, R. O'Donnell, R. Servedio, L-Y. Tan
Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2013)
Blog post about this work: MIT theory student blog
Let $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_n$ be a sum of $n$ independent integer random variables $\mathbf{X}_i$, where each $\mathbf{X}_i$ is supported on $\{0,1,\dots,k-1\}$ but otherwise may have an arbitrary distribution (in particular the $\mathbf{X}_i$'s need not be identically distributed). How many samples are required to learn the distribution $\mathbf{S}$ to high accuracy? In this paper we show that the answer is completely independent of $n$, and moreover we give a computationally efficient algorithm which achieves this low sample complexity. More precisely, our algorithm learns any such $\mathbf{S}$ to $\epsilon$-accuracy (with respect to the total variation distance between distributions) using $\mathrm{poly}(k,1/\epsilon)$ samples, independent of $n$. Its running time is $\mathrm{poly}(k,1/\epsilon)$ in the standard word RAM model. Thus we give a broad generalization of the main result of~\cite{DDS12stoc} which gave a similar learning result for the special case $k=2$ (when the distribution~$\mathbf{S}$ is a Poisson Binomial Distribution).
Prior to this work, no nontrivial results were known for learning these distributions even in the case $k=3$. A key difficulty is that, in contrast to the case of $k = 2$, sums of independent $\{0,1,2\}$-valued random variables may behave very differently from (discretized) normal distributions, and in fact may be rather complicated --- they are not log-concave, they can be $\Theta(n)$-modal, there is no relationship between Kolmogorov distance and total variation distance for the class, etc. Nevertheless, the heart of our learning result is a new limit theorem which characterizes what the sum of an arbitrary number of arbitrary independent $\{0,1,\dots, k-1\}$-valued random variables may look like. Previous limit theorems in this setting made strong assumptions on the "shift invariance" of the random variables $\mathbf{X}_i$ in order to force a discretized normal limit. We believe that our new limit theorem, as the first result for truly arbitrary sums of independent $\{0,1,\dots,k-1\}$-valued random variables, is of independent interest.

Deterministic Approximate Counting for Degree-$2$ Polynomial Threshold Functions [abstract] [pdf]
A. De, I. Diakonikolas, R. Servedio
Manuscript, 2013. Available as ECCC technical report [link]
We give a deterministic algorithm for approximately computing the fraction of Boolean assignments that satisfy a degree-$2$ polynomial threshold function. Given a degree-$2$ input polynomial $p(x_1,\dots,x_n)$ and a parameter $\epsilon > 0$, the algorithm approximates $\Pr_{x \sim \{-1,1\}^n}[p(x) \geq 0]$ to within an additive $\pm \epsilon$ in time $\mathrm{poly}(n,2^{\mathrm{poly}(1/\epsilon)})$. Note that it is NP-hard to determine whether the above probability is nonzero, so any sort of multiplicative approximation is almost certainly impossible even for efficient randomized algorithms. This is the first deterministic algorithm for this counting problem in which the running time is polynomial in $n$ for $\epsilon= o(1)$. For "regular" polynomials $p$ (those in which no individual variable's influence is large compared to the sum of all $n$ variable influences) our algorithm runs in $\mathrm{poly}(n,1/\epsilon)$ time. The algorithm also runs in $\mathrm{poly}(n,1/\epsilon)$ time to approximate $\Pr_{x \sim \mathcal{N}(0,1)^n}[p(x) \geq 0]$ to within an additive $\pm \epsilon$, for any degree-2 polynomial $p$.
As an application of our counting result, we give a deterministic multiplicative $(1 \pm \epsilon)$-approximation algorithm to approximate the $k$-th absolute moment $\mathbf{E}_{x \sim \{-1,1\}^n}[|p(x)^k|]$ of a degree-$2$ polynomial. The algorithm runs in fixed polynomial time for any constants $k$ and $\epsilon.$

A robust Khintchine Inequality and computing optimal constants in Fourier analysis and high-dimensional geometry [abstract] [pdf]
A. De, I. Diakonikolas, R. Servedio
SIAM Journal on Discrete Mathematics, 30-2 (2016), pp. 1058-1094
Proceedings of the 40th Intl. Colloquium on Automata, Languages and Programming (ICALP 2013)
This paper makes two contributions towards determining some well-studied optimal constants in Fourier analysis of Boolean functions and high-dimensional geometry.
(1) It has been known since 1994 \cite{GL:94} that every linear threshold function has squared Fourier mass at least $1/2$ on its degree-$0$ and degree-$1$ coefficients. Denote the minimum such Fourier mass by $\mathbf{W}^{\leq 1}[\mathbf{LTF}]$, where the minimum is taken over all $n$-variable linear threshold functions and all $n \ge 0$. Benjamini, Kalai and Schramm \cite{BKS:99} have conjectured that the true value of $\mathbf{W}^{\leq 1}[\mathbf{LTF}]$ is $2/\pi$. We make progress on this conjecture by proving that $\mathbf{W}^{\leq 1}[\mathbf{LTF}] \geq 1/2 + c$ for some absolute constant $c>0$. The key ingredient in our proof is a "robust" version of the well-known Khintchine inequality in functional analysis, which we believe may be of independent interest.
(2) We give an algorithm with the following property: given any $\eta > 0$, the algorithm runs in time $2^{\mathrm{poly}(1/\eta)}$ and determines the value of $\mathbf{W}^{\leq 1}[\mathbf{LTF}]$ up to an additive error of $\pm\eta$. We give a similar $2^{{\mathrm{poly}(1/\eta)}}$-time algorithm to determine Tomaszewski's constant to within an additive error of $\pm \eta$; this is the minimum (over all origin-centered hyperplanes $H$) fraction of points in $\{-1,1\}^n$ that lie within Euclidean distance $1$ of $H$. Tomaszewski's constant is conjectured to be $1/2$; lower bounds on it have been given by Holzman and Kleitman \cite{HK92} and independently by Ben-Tal, Nemirovski and Roos \cite{BNR02}. Our algorithms combine tools from anti-concentration of sums of independent random variables, Fourier analysis, and Hermite analysis of linear threshold functions.

Learning Mixtures of Structured Distributions over Discrete Domains [abstract] [pdf]
S. Chan, I. Diakonikolas, R. Servedio, X. Sun
Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2013)
Let $\mathfrak{C}$ be a class of probability distributions over the discrete domain $[n] = \{1,\dots,n\}.$ We show that if $\mathfrak{C}$ satisfies a rather general condition -- essentially, that each distribution in $\mathfrak{C}$ can be well-approximated by a variable-width histogram with few bins -- then there is a highly efficient (both in terms of running time and sample complexity) algorithm that can learn any mixture of $k$ unknown distributions from $\mathfrak{C}.$ We analyze several natural types of distributions over $[n]$, including log-concave, monotone hazard rate and unimodal distributions, and show that they have the required structural property of being well-approximated by a histogram with few bins. Applying our general algorithm, we obtain near-optimally efficient algorithms for all these mixture learning problems as described below. More precisely,
Log-concave distributions: We learn any mixture of $k$ log-concave distributions over $[n]$ using $k \cdot \tilde{O}(1/\epsilon^4)$ samples (independent of $n$) and running in time $\tilde{O}(k \log(n) / \epsilon^4)$ bit-operations (note that reading a single sample from $[n]$ takes $\Theta(\log n)$ bit operations). For the special case $k=1$ we give an efficient algorithm using $\tilde{O}(1/\epsilon^3)$ samples; this generalizes the main result of \cite{DDS12stoc} from the class of Poisson Binomial distributions to the much broader class of all log-concave distributions. Our upper bounds are not far from optimal since any algorithm for this learning problem requires $\Omega(k/\epsilon^{5/2})$ samples.
Monotone hazard rate (MHR) distributions: We learn any mixture of $k$ MHR distributions over $[n]$ using $O(k \log (n/\epsilon)/\epsilon^4)$ samples and running in time $\tilde{O}(k \log^2(n) / \epsilon^4)$ bit-operations. Any algorithm for this learning problem must use $\Omega(k \log(n)/\epsilon^3)$ samples.
Unimodal distributions: We give an algorithm that learns any mixture of $k$ unimodal distributions over $[n]$ using $O(k \log (n)/\epsilon^{4})$ samples and running in time $\tilde{O}(k \log^2(n) / \epsilon^{4})$ bit-operations. Any algorithm for this problem must use $\Omega(k \log(n)/\epsilon^3)$ samples.

Testing $k$-modal Distributions: Optimal Algorithms via Reductions [abstract] [pdf]
C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, P. Valiant
Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2013)
We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the $L_1$ (total variation) distance between two $k$-modal distributions $p$ and $q$ over the discrete domain $\{1,\dots,n\}$. More precisely, we consider the following four problems: given sample access to an unknown $k$-modal distribution $p$,
Testing identity to a known or unknown distribution:
(1) Determine whether $p = q$ (for an explicitly given $k$-modal distribution $q$) versus $p$ is $\epsilon$-far from $q$;
(2) Determine whether $p=q$ (where $q$ is available via sample access) versus $p$ is $\epsilon$-far from $q$;
Estimating $L_1$ distance (``tolerant testing'') against a known or unknown distribution:
(3) Approximate $d_{TV}(p,q)$ to within additive $\epsilon$ where $q$ is an explicitly given $k$-modal distribution $q$;
(4)Approximate $d_{TV}(p,q)$ to within additive $\epsilon$ where $q$ is available via sample access.
For each of these four problems we give sub-logarithmic sample algorithms, that we show are tight up to additive $\mathrm{poly}(k)$ and multiplicative $\mathrm{polylog}\log n+\mathrm{polylog} k$ factors. Thus our bounds significantly improve the previous results of \cite{BKR:04}, which were for testing identity of distributions (items (1) and (2) above) in the special cases $k=0$ (monotone distributions) and $k=1$ (unimodal distributions) and required $O((\log n)^3)$ samples.

An Optimal Algorithm for the Efficient Approximation of Convex Pareto Curves
I. Diakonikolas, M. Yannakakis
Manuscript, 2012

Efficiency-Revenue Tradeoffs in Auctions [abstract] [pdf]
I. Diakonikolas, C.H. Papadimitriou, G. Pierrakos, Y. Singer
Proceedings of the 39th Intl. Colloquium on Automata, Languages and Programming (ICALP 2012)
When agents with independent priors bid for a single item, Myerson's optimal auction maximizes expected revenue, whereas Vickrey's second-price auction optimizes social welfare. We address the natural question of trade-offs between the two criteria, that is, auctions that optimize, say, revenue under the constraint that the welfare is above a given level. If one allows for randomized mechanisms, it is easy to see that there are polynomial-time mechanisms that achieve any point in the trade-off (the Pareto curve) between revenue and welfare. We investigate whether one can achieve the same guarantees using deterministic mechanisms. We provide a negative answer to this question by showing that this is a (weakly) NP-hard problem. On the positive side, we provide polynomial-time deterministic mechanisms that approximate with arbitrary precision any point of the trade-off between these two fundamental objectives for the case of two bidders, even when the valuations are correlated arbitrarily. The major problem left open by our work is whether there is such an algorithm for three or more bidders with independent valuation distributions.

The Inverse Shapley Value Problem [abstract] [pdf]
A. De, I. Diakonikolas, R. Servedio
Games and Economic Behavior, accepted with minor revisions
Proceedings of the 39th Intl. Colloquium on Automata, Languages and Programming (ICALP 2012)
For $f$ a weighted voting scheme used by $n$ voters to choose between two candidates, the $n$ Shapley-Shubik Indices (or Shapley values) of $f$ provide a measure of how much control each voter can exert over the overall outcome of the vote. Shapley-Shubik indices were introduced by Lloyd Shapley and Martin Shubik in 1954 \cite{SS54} and are widely studied in social choice theory as a measure of the "influence" of voters. The Inverse Shapley Value Problem is the problem of designing a weighted voting scheme which (approximately) achieves a desired input vector of values for the Shapley-Shubik indices. Despite much interest in this problem no provably correct and efficient algorithm was known prior to our work.
We give the first efficient algorithm with provable performance guarantees for the Inverse Shapley Value Problem. For any constant $\epsilon > 0$ our algorithm runs in fixed poly$(n)$ time (the degree of the polynomial is independent of $\epsilon$) and has the following performance guarantee: given as input a vector of desired Shapley values, if any "reasonable" weighted voting scheme (roughly, one in which the threshold is not too skewed) approximately matches the desired vector of values to within some small error, then our algorithm explicitly outputs a weighted voting scheme that achieves this vector of Shapley values to within error $\epsilon.$ If there is a "reasonable" voting scheme in which all voting weights are integers at most $\mathrm{poly}(n)$ that approximately achieves the desired Shapley values, then our algorithm runs in time $\mathrm{poly}(n)$ and outputs a weighted voting scheme that achieves the target vector of Shapley values to within error $\epsilon=n^{-1/8}.$

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces [abstract] [pdf]
A. De, I. Diakonikolas, V. Feldman, R. Servedio
Journal of the ACM, 61(2), 2014. Invited to Theory of Computing special issue on Analysis of Boolean functions (declined)
Proceedings of the 44th Annual ACM Symposium on Theory of Computing (STOC 2012)
IBM Research 2014 Pat Goldberg Math/CS/EE Best Paper Award
The Chow parameters of a Boolean function $f: \{-1,1\}^n \to \{-1,1\}$ are its $n+1$ degree-$0$ and degree-$1$ Fourier coefficients. It has been known since 1961 \cite{Chow:61, Tannenbaum:61} that the (exact values of the) Chow parameters of any linear threshold function $f$ uniquely specify $f$ within the space of all Boolean functions, but until recently \cite{OS11:chow} nothing was known about efficient algorithms for reconstructing $f$ (exactly or approximately) from exact or approximate values of its Chow parameters. We refer to this reconstruction problem as the Chow Parameters Problem.
Our main result is a new algorithm for the Chow Parameters Problem which, given (sufficiently accurate approximations to) the Chow parameters of any linear threshold function $f$, runs in time $\tilde{O}(n^2)\cdot (1/\epsilon)^{O(\log^2(1/\epsilon))}$ and with high probability outputs a representation of an LTF $f'$ that is $\epsilon$-close to $f$ in Hamming distance. The only previous algorithm \cite{OS11:chow} had running time $\mathrm{poly}(n) \cdot 2^{2^{\tilde{O}(1/\epsilon^2)}}.$
As a byproduct of our approach, we show that for any linear threshold function $f$ over $\{-1,1\}^n$, there is a linear threshold function $f'$ which is $\epsilon$-close to $f$ and has all weights that are integers of magnitude at most $\sqrt{n} \cdot (1/\epsilon)^{O(\log^2(1/\epsilon))}$. This significantly improves the previous best result of~\cite{DiakonikolasServedio:09} which gave a $\mathrm{poly}(n) \cdot 2^{\tilde{O}(1/\epsilon^{2/3})}$ weight bound, and is close to the known lower bound of $\max\{\sqrt{n},$ $(1/\epsilon)^{\Omega(\log \log (1/\epsilon))}\}$ \cite{Goldberg:06b,Servedio:07cc}. Our techniques also yield improved algorithms for related problems in learning theory.
In addition to being significantly stronger than previous work, our results are obtained using conceptually simpler proofs. The two main ingredients underlying our results are (1) a new structural result showing that for $f$ any linear threshold function and $g$ any bounded function, if the Chow parameters of $f$ are close to the Chow parameters of $g$ then $f$ is close to $g$; (2) a new boosting-like algorithm that given approximations to the Chow parameters of a linear threshold function outputs a bounded function whose Chow parameters are close to those of $f$.

Learning Poisson Binomial distributions [abstract] [pdf]
C. Daskalakis, I. Diakonikolas, R. Servedio
Invited to Algorithmica special issue on New Theoretical Challenges in Machine Learning
Proceedings of the 44th Annual ACM Symposium on Theory of Computing (STOC 2012)
We consider a basic problem in unsupervised learning: learning an unknown Poisson Binomial Distribution. A Poisson Binomial Distribution (PBD) over $\{0,1,\dots,n\}$ is the distribution of a sum of $n$ independent Bernoulli random variables which may have arbitrary, potentially non-equal, expectations. These distributions were first studied by S. Poisson in 1837 \cite{Poisson:37} and are a natural $n$-parameter generalization of the familiar Binomial Distribution. Surprisingly, prior to our work this basic learning problem was poorly understood, and known results for it were far from optimal.
We essentially settle the complexity of the learning problem for this basic class of distributions. As our main result we give a highly efficient algorithm which learns to $\epsilon$-accuracy using $\tilde{O}(1/\epsilon^3)$ samples independent of $n$. The running time of the algorithm is quasilinear in the size of its input data. This is nearly optimal since any algorithm must use $\Omega(1/\epsilon^2)$ samples. We also give positive and negative results for some extensions of this learning problem.

Learning $k$-modal distributions via testing [abstract] [pdf]
C. Daskalakis, I. Diakonikolas, R. Servedio
Theory of Computing, 10 (20), 535-570 (2014)
Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2012)
Oded Goldreich's Choices #72
A $k$-modal probability distribution over the domain $\{1,...,n\}$ is one whose histogram has at most $k$ "peaks" and "valleys." Such distributions are natural generalizations of monotone ($k=0$) and unimodal ($k=1$) probability distributions, which have been intensively studied in probability theory and statistics. In this paper we consider the problem of learning an unknown $k$-modal distribution. The learning algorithm is given access to independent samples drawn from the $k$-modal distribution $p$, and must output a hypothesis distribution $\hat{p}$ such that with high probability the total variation distance between $p$ and $\hat{p}$ is at most $\epsilon.$
We give an efficient algorithm for this problem that runs in time $\mathrm{poly}(k,\log(n),1/\epsilon)$. For $k \leq \tilde{O}(\sqrt{\log n})$, the number of samples used by our algorithm is very close (within an $\tilde{O}(\log(1/\epsilon))$ factor) to being information-theoretically optimal. Prior to this work computationally efficient algorithms were known only for the cases $k=0,1$ \cite{Birge:87b,Birge:97}.
A novel feature of our approach is that our learning algorithm crucially uses a new property testing algorithm as a key subroutine. The learning algorithm uses the property tester to efficiently decompose the $k$-modal distribution into $k$ (near)-monotone distributions, which are easier to learn.

Noise Stable Halfspaces are Close to Very Small Juntas [abstract] [pdf]
I. Diakonikolas, R. Jaiswal, R. Servedio, L.-Y.Tan, A. Wan
Chicago Journal of Theoretical Computer Science, 2015
Bourgain~\cite{Bourgain:02} showed that any noise stable Boolean function $f$ can be well-approximated by a junta. In this note we give an exponential sharpening of the parameters of Bourgain's result under the additional assumption that $f$ is a halfspace.

Supervised Design Space Exploration by Compositional Approximation of Pareto sets [abstract] [pdf]
H.-Y. Liu, I. Diakonikolas, M. Petracca, L.P. Carloni
Proceedings of the 48th Design Automation Conference (DAC 2011)
Technology scaling allows the integration of billions of transistors on the same die but CAD tools struggle in keeping up with the increasing design complexity. Design productivity for multi-core SoCs increasingly depends on creating and maintaining reusable components and hierarchically combining them to form larger composite cores. Characterizing such composite cores with respect to their power/performance trade-offs is critical for design reuse across various products and relies heavily on synthesis tools.
We present $\mathrm{CAPS}$, an online adaptive algorithm that efficiently explores the design space of any given core and returns an accurate characterization of its implementation trade-offs in terms of an approximate Pareto set. It does so by supervising the order of the time-consuming logic-synthesis runs on the core's components. Our algorithm can provably achieve the desired precision on the approximation in the shortest possible time, without having any a-priori information on any component. We also show that, in practice, $\mathrm{CAPS}$ works even better than what is guaranteed by the theory.

Disjoint-Path Facility Location: Theory and Practice [abstract] [pdf]
L. Breslau, I. Diakonikolas, N. Duffield, Y.Gu, M.T. Hajiaghayi, D.S. Johnson, H. Karloff, M. Resende, S.Sen
Proceedings of the 13th Workshop on Algorithm Engineering and Experiments (ALENEX 2011)
Journal version in Operations Research, 2019 [arxiv]
This paper is a theoretical and experimental study of two related facility location problems that emanated from networking. Suppose we are given a network modeled as a directed graph $G = (V,A)$, together with (not-necessarily-disjoint) subsets $C$ and $F$ of $V$ , where $C$ is a set of customer locations and $F$ is a set of potential facility locations (and typically $C\subseteq F$). Our goal is to find a minimum sized subset $F' \subseteq F$ such that for every customer $c \in C$ there are two locations $f_1, f_2 \in F'$ such that traffic from $c$ to $f_1$ and to $f_2$ is routed on disjoint paths (usually shortest paths) under the network's routing protocols. Although we prove that this problem is impossible to approximate in the worst case even to within a factor of $2^{\log^{1-\epsilon} n}$ for any $\epsilon>0$ (assuming no NP-complete language can be solved in quasi-polynomial time), we show that the situation is much better in practice. We propose three algorithms that build solutions and determine lower bounds on the optimum solution, and evaluate them on several large real ISP topologies and on synthetic networks designed to reflect real-world LAN/WAN network structure. Our main algorithms are (1) an algorithm that performs multiple runs of a straightforward randomized greedy heuristic and returns the best result found, (2) a genetic algorithm that uses the greedy algorithm as a subroutine, and (3) a new "Double Hitting Set" algorithm. All three approaches perform surprising well, although, in practice, the most cost-effective approach is the multirun greedy algorithm. This yields results that average within 0.7% of optimal for our synthetic instances and within 2.9% for our real-world instances, excluding the largest (and most realistic) one. For the latter instance, the other two algorithms come into their own, finding solutions that are more than three times better than those of the multi-start greedy approach. In terms of our motivating monitoring application, where every customer location can be a facility location, the results are even better. Here the above Double Hitting Set solution is 90% better than the default solution which places a monitor at each customer location. Our results also show that, on average for our real-world instances, we could save an additional 18% by choosing the (shortest path) routes ourselves, rather than taking the simpler approach of relying on the network to choose them for us.

Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions [abstract] [pdf]
I. Diakonikolas, R. O'Donnell, R. Servedio, Y.Wu
Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2011)
Hardness results for maximum agreement problems have close connections to hardness results for proper learning in computational learning theory. In this paper we prove two hardness results for the problem of finding a low degree polynomial threshold function (PTF) which has the maximum possible agreement with a given set of labeled examples in $\mathbf{R}^n \times \{-1,1\}.$ We prove that for any constants $d\geq 1, \epsilon > 0$,
(1) Assuming the Unique Games Conjecture, no polynomial-time algorithm can find a degree-$d$ PTF that is consistent with a $(1/2 + \epsilon)$ fraction of a given set of labeled examples in $\mathbf{R}^n \times \{-1,1\}$, even if there exists a degree-$d$ PTF that is consistent with a $1-\epsilon$ fraction of the examples.
(2) It is NP-hard to find a degree-$2$ PTF that is consistent with a $(1/2 + \epsilon)$ fraction of a given set of labeled examples in $\mathbf{R}^n \times \{-1,1\}$, even if there exists a halfspace (degree-$1$ PTF) that is consistent with a $1 - \epsilon$ fraction of the examples.
These results immediately imply the following hardness of learning results: (i) Assuming the Unique Games Conjecture, there is no better-than-trivial proper learning algorithm that agnostically learns degree-$d$ PTFs under arbitrary distributions; (ii) There is no better-than-trivial learning algorithm that outputs degree-$2$ PTFs and agnostically learns halfspaces (i.e., degree-$1$ PTFs) under arbitrary distributions.

Bounded Independence Fools Degree-$2$ Threshold Functions [abstract] [pdf]
I. Diakonikolas, D. Kane, J. Nelson
Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2010)
Let $x$ be a random vector coming from any $k$-wise independent distribution over $\{-1,1\}^n$. For an $n$-variate degree-$2$ polynomial $p$, we prove that $\mathbf{E}[\mathrm{sgn}(p(x))]$ is determined up to an additive $\epsilon$ for $k = \mathrm{poly}(1/\epsilon)$. This gives a large class of explicit pseudo-random generators against such functions and answers an open question of Diakonikolas et al. (FOCS 2009).
In the process, we develop a novel analytic technique we dub multivariate FT-mollification. This provides a generic tool to approximate bounded (multivariate) functions by low-degree polynomials (with respect to several different notions of approximation). A univariate version of the method was introduced by Kane et al. (SODA 2010) in the context of streaming algorithms. In this work, we refine it and generalize it to the multivariate setting. We believe that our technique is of independent mathematical interest. To illustrate its generality, we note that it implies a multidimensional generalization of Jackson's classical result in approximation theory due to (Newman and Shapiro, 1963).
To obtain our main result, we combine the FT-mollification technique with several linear algebraic and probabilistic tools. These include the invariance principle of of Mossell, O'Donnell and Oleszkiewicz, anti-concentration bounds for low-degree polynomials, an appropriate decomposition of degree-$2$ polynomials, and a generalized hyper-contractive inequality for quadratic forms which takes the operator norm of the associated matrix into account. Our analysis is quite modular; it readily adapts to show that intersections of halfspaces and degree-$2$ threshold functions are fooled by bounded independence. From this it follows that $\Omega(1/\epsilon^2)$-wise independence derandomizes the Goemans-Williamson hyperplane rounding scheme.
Our techniques unify, simplify, and in some cases improve several recent results in the literature concerning threshold functions. For the case of "regular" halfspaces we give a simple proof of an optimal independence bound of $\Theta(1/\epsilon^2)$, improving upon Diakonikolas et al. (FOCS 2009) by polylogarithmic factors. This yields the first optimal derandomization of the Berry-Esseen theorem and -- combined with the results of Kalai et al. (FOCS 2005) -- implies a faster algorithm for the problem of agnostically learning halfspaces.

Average Sensitivity and Noise Sensitivity of Polynomial Threshold Functions [abstract] [pdf]
I. Diakonikolas, P. Raghavendra, R. Servedio, L.-Y. Tan
SIAM Journal on Computing, 43(1), 231-253 (2014)
Proceedings of the 42nd Annual ACM Symposium on Theory of Computing (STOC 2010)
(Conference version merged with this paper by Harsha, Klivans and Meka)
We give the first non-trivial upper bounds on the Boolean average sensitivity and noise sensitivity of degree-$d$ polynomial threshold functions (PTFs). Our bound on the Boolean average sensitivity of PTFs represents the first progress towards the resolution of a conjecture of Gotsman and Linial \cite{GL:94}, which states that the symmetric function slicing the middle $d$ layers of the Boolean hypercube has the highest average sensitivity of all degree-$d$ PTFs. Via the $L_1$ polynomial regression algorithm of Kalai et al. \cite{KKMS:08}, our bound on Boolean noise sensitivity yields the first polynomial-time agnostic learning algorithm for the broad class of constant-degree PTFs under the uniform distribution.
To obtain our bound on the Boolean average sensitivity of PTFs, we generalize the "critical-index" machinery of \cite{Servedio:07cc} (which in that work applies to halfspaces, i.e., degree-$1$ PTFs) to general PTFs. Together with the "invariance principle" of \cite{MOO10}, this allows us to essentially reduce the Boolean setting to the Gaussian setting. The main ingredients used to obtain our bound in the Gaussian setting are tail bounds and anti-concentration bounds on low-degree polynomials in Gaussian random variables \cite{Janson:97,CW:01}. Our bound on Boolean noise sensitivity is achieved via a simple reduction from upper bounds on average sensitivity of Boolean PTFs to corresponding bounds on noise sensitivity.

A Regularity Lemma, and Low-weight Approximators, for low-degree Polynomial Threshold Functions [abstract] [pdf]
I. Diakonikolas, R. Servedio, L.-Y. Tan, A. Wan
Theory of Computing, 10(2), 27-53 (2014)
Proceedings of the 25th Annual IEEE Conference on Computational Complexity (CCC 2010)

How Good is the Chord Algorithm? [abstract] [pdf]
C. Daskalakis, I. Diakonikolas, M. Yannakakis
SIAM Journal on Computing, 45(3), pp. 811-858, 2016
Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2010)

Bounded Independence Fools Halfspaces [abstract] [pdf]
I. Diakonikolas, P. Gopalan, R. Jaiswal, R. Servedio, E. Viola
SIAM Journal on Computing, 39(8), 3441-3462 (2010)
Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009)

Improved Approximation of Linear Threshold Functions [abstract] [pdf]
I. Diakonikolas, R. Servedio
Computational Complexity, 22(3), 623-677 (2013)
Proceedings of the 24th Annual IEEE Conference on Computational Complexity (CCC 2009)

Efficiently Testing Sparse $GF(2)$ Polynomials [abstract] [pdf]
I. Diakonikolas, H. Lee, K. Matulef, R. Servedio, A. Wan
Algorithmica, 61(3), 580-605 (2011)
Proceedings of the 35th Intl. Colloquium on Automata, Languages and Programming (ICALP 2008)

Succinct Approximate Convex Pareto Curves [abstract] [pdf]
I. Diakonikolas, M. Yannakakis
Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2008)

Testing for Concise Representations [abstract] [pdf]
I. Diakonikolas, H. Lee, K. Matulef, K. Onak, R. Rubinfeld, R. Servedio, A. Wan
Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007)
Oded Goldreich's Choices #39

Small Approximate Pareto Sets for Biobjective Shortest Paths and Other Problems [abstract] [pdf]
I. Diakonikolas, M. Yannakakis
SIAM Journal on Computing, 39(4), 1340-1371 (2009)
Proceedings of the 10th Intl. Workshop on Approximation, Randomization, and Combinatorial Optimization (APPROX 2007)
Honorable Mention, George Nicholson student paper competition INFORMS society, 2009

Ilias Diakonikolas

Robust Statistics Book

Expository Articles

Recent Activities

Research Publications [chronologically][by topic]

Thesis

Professional Activities

Recent Teaching

Advising